Original title: "ChatGPT, whose popularity is "soaring", urgently needs "compliance brakes" Original author: Xiao Sa's legal team Core reminder: ChatGPT and other chat AI based on natural language processing technology have three main legal compliance issues that need to be urgently resolved in the short term: First, the intellectual property issues of the replies provided by chat AI. The most important compliance problem is whether the replies produced by chat AI generate corresponding intellectual property rights? Is intellectual property authorization required? Second, does the process of data mining and training of chat AI on huge amounts of natural language processing texts (generally referred to as corpora) require corresponding intellectual property authorization? Third, one of the mechanisms of chat AI such as ChatGPT is to obtain a language model based on statistics by mathematically counting a large number of existing natural language texts. This mechanism makes it very likely that chat AI will "talk nonsense seriously", which in turn leads to legal risks of false information dissemination. In this technical context, how to minimize the risk of false information dissemination by chat AI? In general, my country's legislation on artificial intelligence is still in the pre-research stage. There is no formal legislative plan or relevant motion draft. The relevant departments are particularly cautious about the supervision of the field of artificial intelligence. With the gradual development of artificial intelligence, the corresponding legal compliance problems will only increase. 1. ChatGPT is not a "cross-era artificial intelligence technology" ChatGPT is essentially a product of the development of natural language processing technology, and is essentially still just a language model. At the beginning of 2023, the huge investment of Microsoft, a global technology giant, made ChatGPT a "top stream" in the field of technology and successfully broke out of the circle. With the surge in the ChatGPT concept sector in the capital market, many domestic technology companies have also begun to lay out this field. While the capital market is enthusiastic about the ChatGPT concept, as legal workers, we can't help but evaluate what legal security risks ChatGPT itself may bring, and what is its legal compliance path? Before discussing the legal risks and compliance paths of ChatGPT, we should first examine the technical principles of ChatGPT-can ChatGPT, as the news says, give the questioner any questions he wants?In the eyes of the Sajie team, ChatGPT seems to be far less "magical" than some news reports have claimed. In a word, it is just an integration of natural language processing technologies such as Transformer and GPT. In essence, it is still a language model based on neural networks, rather than a "cross-era AI progress". As mentioned earlier, ChatGPT is a product of the development of natural language processing technology. From the perspective of the development history of this technology, it has roughly gone through three stages: grammar-based language model, statistical-based language model, and neural network-based language model. The stage where ChatGPT is located is the language model stage based on neural networks. If you want to understand the working principle of ChatGPT and the legal risks that may be caused by this principle more directly, you must first explain the working principle of the language model based on neural networks, which is the predecessor of the language model based on statistics. In the statistical language model stage, AI engineers determine the probability of successive connections between words by counting a huge amount of natural language text. When people ask a question, AI begins to analyze which words have a high probability of being matched in the language environment composed of the constituent words of the question, and then splices these high-probability words together to return a statistically based answer. It can be said that this principle has been running through the development of natural language processing technology since its emergence. In a sense, the language model based on neural networks that appeared later is also a correction to the language model based on statistics. To give an easy-to-understand example, the Sister Sa team entered the question "What tourist attractions are there in Dalian?" in the ChatGPT chat box as shown in the figure below: The first step of AI is to analyze the basic morphemes in the question "Dalian, which, tourism, resorts", and then find the natural language text set where these morphemes are located in the existing corpus, find the most likely collocation in this set, and then combine these collocations to form the final answer. For example, AI will find that in the corpus where the three words "Dalian, tourism, resorts" appear with a high probability, there is the word "Zhongshan Park", so it will return "Zhongshan Park", and for example, the word "park" has the highest probability of being matched with words such as gardens, lakes, fountains, and statues, so it will further return "This is a park with a long history, with beautiful gardens, lakes, fountains, and statues."In other words, the whole process is based on the probability statistics of the existing natural language text information (corpus) behind AI, so the answers returned are also "statistical results", which leads to ChatGPT's "serious nonsense" on many issues. For example, in the answer to the question "What are the tourist attractions in Dalian?", although Dalian has Zhongshan Park, there are no lakes, fountains or statues in Zhongshan Park. Dalian did have "Stalin Square" in history, but Stalin Square has never been a commercial square, and there are no shopping malls, restaurants or entertainment venues. Obviously, the information returned by ChatGPT is false. 2. ChatGPT is currently the most suitable application scenario for a language model. Although we have explained the drawbacks of statistical language models in the previous section, ChatGPT is, after all, a neural network-based language model that has been greatly improved from statistical language models. Its technical foundations, Transformer and GPT, are both the latest generation of language models. ChatGPT essentially combines massive amounts of data with a Transformer model with strong expressive power, thereby modeling natural language in great depth. Although the returned sentences are sometimes "nonsense", they still look like "human replies" at first glance. Therefore, this technology has a wide range of application scenarios in scenarios that require massive human-computer interaction. At present, there are three such scenarios: first, search engines; second, human-computer interaction mechanisms in banks, law firms, various intermediary agencies, shopping malls, hospitals, and government service platforms, such as customer complaint systems, medical guidance navigation, and government consultation systems in the above-mentioned places; third, interaction mechanisms for smart cars, smart homes (such as smart speakers and smart lights), etc. Search engines that combine AI chat technologies such as ChatGPT are likely to present a path of traditional search engines as the main and language models based on neural networks as the auxiliary. At present, traditional search giants such as Google and Baidu have deep accumulation in language model technology based on neural networks. For example, Google has Sparrow and Lamda, which are comparable to ChatGPT. With the support of these language models, search engines will be more "humanized".The use of AI chat technologies such as ChatGPT in customer complaint systems, hospital and shopping mall guidance navigation, and government consultation systems will greatly reduce the human resource costs of related units and save communication time, but the problem is that statistical answers may produce completely wrong content replies, and the resulting risk control risks may need further evaluation. Compared with the above two application scenarios, the legal risk of ChatGPT being used in smart cars, smart homes and other fields as the human-computer interaction mechanism of the above devices is much smaller, because the application environment in these fields is relatively private, and the erroneous content of AI feedback will not cause major legal risks. At the same time, these scenarios do not require high content accuracy, and the business model is more mature. 3. A preliminary exploration of the legal risks and compliance paths of ChatGPT First, the overall regulatory landscape of artificial intelligence in my country is similar to many emerging technologies. The natural language processing technology represented by ChatGPT also faces the "Collingridge dilemma". This dilemma includes information dilemma and control dilemma. The so-called information dilemma means that the social consequences brought about by an emerging technology cannot be predicted in the early stage of the technology; the so-called control dilemma means that when the adverse social consequences brought about by an emerging technology are discovered, the technology has often become part of the entire social and economic structure, making it impossible to effectively control the adverse social consequences. At present, the field of artificial intelligence, especially the field of natural language processing technology, is in a rapid development stage. This technology is likely to fall into the so-called "Collingridge dilemma", and the corresponding legal supervision does not seem to "keep up with the pace". At present, there is no national artificial intelligence industry legislation in my country, but there have been relevant legislative attempts at the local level. Just last September, Shenzhen announced the "Regulations on Promoting the Artificial Intelligence Industry in Shenzhen Special Economic Zone", a special legislation for the national artificial intelligence industry, and then Shanghai also passed the "Regulations on Promoting the Development of the Artificial Intelligence Industry in Shanghai". It is believed that similar artificial intelligence industry legislation will be introduced in various places soon. In terms of the ethical regulation of artificial intelligence, the National New Generation Artificial Intelligence Governance Professional Committee also issued the "Ethical Code for New Generation Artificial Intelligence" in 2021, proposing to integrate ethics into the entire life cycle of artificial intelligence research and development and application. Perhaps in the near future, the "Three Laws of Robotics" similar to Asimov's novels will become the iron law of supervision in the field of artificial intelligence.Second, the legal risk of false information brought by ChatGPT will shift the focus from the macro to the micro. Putting aside the overall regulatory landscape of the artificial intelligence industry and the ethical regulation of artificial intelligence, the real compliance issues existing in AI chat foundations such as ChatGPT also need to be paid attention to urgently. Among them, the more difficult one is the problem of false information replied by ChatGPT. As mentioned in the second part of this article, the working principle of ChatGPT causes its reply to be completely "serious nonsense". This kind of false information that seems to be true but is actually outrageous is extremely misleading. Of course, false replies to questions such as "What are the tourist attractions in Dalian" may not cause serious consequences, but if ChatGPT is applied to search engines, customer complaint systems and other fields, the false information it replies to may cause extremely serious legal risks. In fact, such legal risks have already appeared. The language model Galactica in the field of Meta service research, which was launched almost at the same time as ChatGPT in November 2022, was complained by users and taken offline after only 3 days of testing because of the problem of mixed true and false answers. Under the premise that the technical principles cannot be broken through in a short time, if ChatGPT and similar language models are applied to search engines, customer complaint systems and other fields, they must be transformed for compliance. When it is detected that users may ask professional questions, users should be guided to consult relevant professionals rather than looking for answers from artificial intelligence. At the same time, users should be reminded that the authenticity of the questions returned by chat AI may need further verification to minimize the corresponding compliance risks. Third, the intellectual property compliance issues brought by ChatGPT When turning the focus from macro to micro, in addition to the authenticity of AI reply information, the intellectual property issues of chat AI, especially large language models like ChatGPT, should also attract the attention of compliance personnel. The first compliance problem is whether "text data mining" requires corresponding intellectual property authorization. As mentioned above, the working principle of ChatGPT relies on a huge amount of natural language text (or speech corpus). ChatGPT needs to mine and train the data in the corpus. ChatGPT needs to copy the content of the corpus to its own database. The corresponding behavior is usually called "text data mining" in the field of natural language processing.When the corresponding text data may constitute a work, whether text data mining infringes the right of reproduction is still controversial. In the field of comparative law, Japan and the European Union have expanded the scope of fair use in their copyright legislation, adding "text data mining" in AI as a new fair use situation. Although some scholars advocated changing my country's fair use system from "closed" to "open" during the revision of my country's Copyright Law in 2020, this proposition was not adopted in the end. At present, my country's Copyright Law still maintains the closed provisions of the fair use system. Only the thirteen situations stipulated in Article 24 of the Copyright Law can be recognized as fair use. In other words, my country's Copyright Law currently does not include "text data mining" in AI in the scope of reasonable application, and text data mining in my country still requires corresponding intellectual property authorization. The second compliance problem is whether the response generated by ChatGPT is original? Regarding the question of whether the work generated by AI is original, the Sister Sa team believes that its judgment criteria should not be different from the existing judgment criteria. In other words, whether a certain response is completed by AI or humans, it should be judged according to the existing originality standards. In fact, behind this question is another more controversial question. If the reply generated by AI is original, can the copyright owner be AI? Obviously, under the intellectual property laws of most countries, including my country, the author of a work can only be a natural person, and AI cannot be the author of a work. Finally, if ChatGPT splices third-party works in its replies, how should its intellectual property issues be handled? The Sister Sa team believes that if ChatGPT splices copyrighted works in the corpus into its replies (although based on the working principle of ChatGPT, the probability of this happening is small), then according to China's current copyright law, unless it constitutes fair use, it is not necessary to obtain authorization from the copyright owner before copying.
