ChatGPT, which is "surging" in popularity, urgently needs "compliance brakes"

律动BlockBeats · 2023-02-06T14:05:01.000Z

Original title: "ChatGPT, which is so popular, is in urgent need of a "compliance brake"" Original author: Xiao Sa's legal team Core tip: ChatGPT and other chat AIs based on natural language processing technology have urgent legal compliance issues that need to be resolved in the short term There are three main issues: First, the intellectual property rights issue in the responses provided by the chat AI. The most important compliance problem is whether the responses generated by the chat AI generate corresponding intellectual property rights? Is intellectual property authorization required? Second, does the process of data mining and training of chat AI on huge amounts of natural language processing texts (generally called corpus) require corresponding intellectual property authorization? Third, one of the mechanisms of chat AI such as ChatGPT is to perform mathematical statistics on a large number of existing natural language texts to obtain a language model based on statistics. This mechanism makes chat AI likely to "seriously talk nonsense". "Eight Paths", which in turn leads to legal risks in the spread of false information. Under this technical background, how to reduce the risk of false information spread in chat AI as much as possible? Generally speaking, my country's artificial intelligence legislation is still in the pre-research stage, and there is no formal legislative plan or relevant draft motion. Relevant departments are particularly cautious in supervising the field of artificial intelligence. With the gradual development of artificial intelligence, corresponding Legal compliance headaches are only growing. <img src="https://image.theblockbeats.info/upload/2023-02-06/a1cadccfebe799969a48aaf86ed988cee902194d.png?x-oss-process=image/quality,q_50/format,webp" />1. ChatGPT is not "Cross-era artificial intelligence technology" ChatGPT is essentially a product of the development of natural language processing technology, and is still essentially just a language model. At the beginning of 2023, the huge investment from the global technology giant Microsoft made ChatGPT become the "top class" in the technology field and successfully emerged from the circle. With the surge in the concept of ChatGPT in the capital market, many domestic technology companies have also begun to deploy in this field. While the concept of ChatGPT is enthusiastic in the capital market, as legal practitioners, we cannot help but evaluate what legal security ChatGPT itself may bring. What is the risk and legal compliance path? Before discussing the legal risks and compliance paths of ChatGPT, we should first examine the technical principles of ChatGPT - can ChatGPT, as the news says, give the questioner any question he wants?From the perspective of Sajie's team, ChatGPT seems to be far less "magical" than some news has promoted - in one sentence, it is just an integration of natural language processing technologies such as Transformer and GPT, and is still essentially a language based on neural networks. A model rather than a “generational AI advancement.” As mentioned earlier, ChatGPT is the product of the development of natural language processing technology. Judging from the development history of this technology, it has roughly gone through three stages: grammar-based language model - statistics-based language model - neural network-based language model. , the stage that ChatGPT is in is the language model stage based on neural networks. If you want to understand more directly the working principle of ChatGPT and the legal risks that this principle may cause, you must first clarify the predecessor of the language model based on neural networks—— How statistics-based language models work. In the language model stage based on statistics, AI engineers conduct statistics on huge amounts of natural language text to determine the probability of sequential connections between words. When people ask a question, AI begins to analyze the language environment composed of the words that make up the question. Next, which word combinations are high-probability, and then spliced together these high-probability words to return an answer based on statistics. It can be said that this principle has permeated the development of natural language processing technology since its emergence. In a sense, the subsequent neural network-based language models are also modifications of the statistics-based language models. To give an easy-to-understand example, Sister Sa's team entered the question "What tourist attractions are there in Dalian?" in the ChatGPT chat box, as shown below: <img src="https://image.theblockbeats.info/upload/2023-02 -06/7da26df7a83baf297c0bead1c28857afaa2a6fda.png?x-oss-process=image/quality,q_50/format,webp" />AI will analyze the basic morphemes in the question "Dalian, which, tourism, resort" in the first step, and then add the existing Find the natural language text collection where these morphemes are located in the corpus, find the collocation with the highest probability of occurrence in this collection, and then combine these collocations to form the final answer. For example, the AI will find that the word "Zhongshan Park" is included in the corpus of the three words "Dalian, tourism, resort" with a high probability of occurrence, so it will return "Zhongshan Park". Another example is that the word "park" is associated with gardens, Words such as lake, fountain, and statue have the highest probability of matching, so it will further return to "This is a historic park with beautiful gardens, lakes, fountains, and statues."” In other words, the entire process is based on probability statistics based on the existing natural language text information (corpus) behind AI, so the answers returned are also “statistical results”, which leads ChatGPT to be “serious” on many issues. nonsense". Just like the answer to the question "What tourist attractions are there in Dalian?", although Dalian has Zhongshan Park, there are no lakes, fountains or statues in Zhongshan Park. Dalian did have a "Stalin Square" in history, but Stalin Square was never a commercial square from beginning to end, and it did not have any shopping malls, restaurants, or entertainment venues. Apparently, the information returned by ChatGPT is false. <img src="https://image.theblockbeats.info/upload/2023-02-06/75ad76d69613520df24f72ec636181ba4843dbef.png?x-oss-process=image/quality,q_50/format,webp" />2. ChatGPT as a language The model is currently the most suitable application scenario. Although we straightforwardly explained the shortcomings of the statistics-based language model in the previous section, ChatGPT is, after all, a neural network-based language model that has greatly improved the statistics-based language model. Technical basis Transformer and GPT are both the latest generation of language models. ChatGPT essentially combines massive data with the highly expressive Transformer model to perform a very in-depth modeling of natural language. Although the returned statements sometimes It is "nonsense", but at first glance it still looks like "human reply", so this technology has a wide range of application scenarios in scenarios that require massive human-computer interaction. At present, there are three such scenarios: first, search engines; second, human-computer interaction mechanisms in banks, law firms, various intermediaries, shopping malls, hospitals, and government service platforms, such as those in the above places Customer complaint system, medical guidance and navigation, government consultation system; third, the interaction mechanism of smart cars, smart homes (such as smart speakers, smart lights), etc. A search engine that combines AI chat technologies such as ChatGPT is likely to take a traditional search engine-based approach supplemented by a neural network-based language model. At present, traditional search giants such as Google and Baidu have deep accumulation in language model technology based on neural networks. For example, Google has Sparrow and Lamda that are comparable to ChatGPT. With the blessing of these language models, search engines will be more " Humanize".The application of AI chat technologies such as ChatGPT in customer complaint systems, guidance navigation in hospitals and shopping malls, and government consultation systems of government agencies will significantly reduce the human resource costs of relevant units and save communication time. However, the problem is that answers based on statistics may Generating completely wrong content replies, the resulting risk control risks may require further evaluation. Compared with the above two application scenarios, the legal risk of ChatGPT application becoming a human-computer interaction mechanism for the above-mentioned devices in fields such as smart cars and smart homes is much smaller, because the application environment in such fields is relatively private and the error content of AI feedback is not As for causing major legal risks, this type of scenario does not have high requirements for content accuracy and the business model is more mature. <img src="https://image.theblockbeats.info/upload/2023-02-06/5cbd33536783a11ed6940485aee55f4d1a1706b7.png?x-oss-process=image/quality,q_50/format,webp" />3. ChatGPT Legal A preliminary exploration of risk and compliance paths. First, the overall regulatory landscape of artificial intelligence in my country. Like many emerging technologies, the natural language processing technology represented by ChatGPT also faces the "Collingridge dilemma (Collingridge dilemma)". This dilemma includes: The information dilemma and the control dilemma are divided into information dilemma and control dilemma. The so-called information dilemma means that the social consequences of an emerging technology cannot be expected in the early stage of the technology; the so-called control dilemma means that when an emerging technology brings adverse social consequences By the time it is discovered, the technology has often become part of the entire social and economic structure, making it impossible to effectively control the adverse social consequences. At present, the field of artificial intelligence, especially the field of natural language processing technology, is in a stage of rapid development. This technology is likely to fall into the so-called "Collingridge Dilemma", and the corresponding legal supervision does not seem to "keep up with the pace." . There is currently no national-level artificial intelligence industry legislation in our country, but there have been relevant local legislative attempts. Just last September, Shenzhen announced the national special legislation for the artificial intelligence industry, the "Shenzhen Special Economic Zone Artificial Intelligence Industry Promotion Regulations", and then Shanghai also passed the "Shanghai Regulations on Promoting the Development of the Artificial Intelligence Industry". I believe that soon, various places will All will introduce similar legislation for the artificial intelligence industry.In terms of ethical regulation of artificial intelligence, the National New Generation Artificial Intelligence Governance Professional Committee also released the "New Generation Artificial Intelligence Ethics Code" in 2021, proposing to integrate ethics and morality into the full life cycle of artificial intelligence research and development and application. Perhaps in the near future In the future, the "Three Laws of Robotics" similar to those in Asimov's novels will become the iron laws governing the field of artificial intelligence. Second, the legal risks of false information brought about by ChatGPT have shifted the focus from the macro to the micro. Putting aside the overall regulatory landscape of the artificial intelligence industry and the ethical regulation of artificial intelligence, the practical compliance issues existing in the foundation of AI chat such as ChatGPT also need urgent attention. The more troublesome issue is the false information that ChatGPT replies. As mentioned in the second part of this article, the working principle of ChatGPT means that its replies may be complete "serious nonsense". This kind of false information that seems to be true is actually outrageous. Extremely misleading. Of course, false responses to questions such as "What tourist attractions are there in Dalian?" may not cause serious consequences, but if ChatGPT is applied to search engines, customer complaint systems, etc., the false information it replies may cause extremely serious legal risks. . In fact, such legal risks have already emerged. Galactica, a language model for the scientific research field of the Meta service, which was launched almost at the same time as ChatGPT in November 2022, was shut down by users after only 3 days of testing due to problems with mixed true and false answers. On the premise that technical principles cannot be broken through in a short time, if ChatGPT and similar language models are applied to search engines, customer complaint systems and other fields, they must be transformed for compliance. When it is detected that a user may ask a professional question, the user should be guided to consult the corresponding professional instead of looking for answers from artificial intelligence. At the same time, the user should be clearly reminded that the authenticity of the questions returned by the chat AI may need further verification to minimize the risk of corresponding compliance risks. Third, the intellectual property compliance issues brought about by ChatGPT. When we turn our attention from the macro to the micro, in addition to the authenticity of AI reply messages, the intellectual property issues of chat AI, especially large language models like ChatGPT, should also cause compliance issues. Attention of personnel.The first compliance issue is whether “text data mining” requires corresponding intellectual property authorization. As indicated above, the working principle of ChatGPT relies on a huge amount of natural language texts (or speech corpora). ChatGPT needs to mine and train the data in the corpus. ChatGPT needs to copy the contents of the corpus into its own database. The corresponding behavior is usually called "text data mining" in the field of natural language processing. It is still controversial whether text data mining infringes the right of reproduction when the corresponding text data may constitute a work. In the field of comparative law, both Japan and the European Union have expanded the scope of fair use in their copyright legislation, adding "text data mining" in AI as a new fair use situation. Although some scholars advocated changing my country's fair use system from "closed" to "open" during the revision of my country's Copyright Law in 2020, this idea was not adopted in the end. At present, my country's copyright law still maintains the fair use system. Closely stipulated, only the thirteen situations stipulated in Article 24 of the Copyright Law can be recognized as fair use. In other words, my country’s Copyright Law currently does not include “text data mining” in AI within the scope of reasonable application. Text data mining still requires corresponding intellectual property authorization in my country. The second compliance challenge is, are the responses generated by ChatGPT original? Regarding the question of whether works generated by AI are original, Sajie’s team believes that the judgment criteria should not be different from the existing judgment criteria. In other words, whether a certain answer is completed by AI or by humans, it should be based on existing standards for originality. In fact, behind this question is another more controversial question. If the reply generated by AI is original, can the copyright holder be AI? Obviously, under the intellectual property laws of most countries, including our country, the author of a work can only be a natural person, and AI cannot be the author of the work. Finally, if ChatGPT splices third-party works into its reply, how should its intellectual property issues be handled?The Sajie team believes that if ChatGPT’s reply contains copyrighted works in the corpus (although based on the working principle of ChatGPT, the probability of this happening is small), then according to China’s current copyright law, unless it constitutes fair use, otherwise Reproduction is not permitted without the permission of the copyright holder. <img src="https://image.theblockbeats.info/upload/2023-02-06/5b0c02914434e5d980340ee822aa513fd3f940d6.png?x-oss-process=image/quality,q_50/format,webp" />

Original title: "ChatGPT, whose popularity is "soaring", urgently needs "compliance brakes" Original author: Xiao Sa's legal team Core reminder: ChatGPT and other chat AI based on natural language processing technology have three main legal compliance issues that need to be urgently resolved in the short term: First, the intellectual property issues of the replies provided by chat AI. The most important compliance problem is whether the replies produced by chat AI generate corresponding intellectual property rights? Is intellectual property authorization required? Second, does the process of data mining and training of chat AI on huge amounts of natural language processing texts (generally referred to as corpora) require corresponding intellectual property authorization? Third, one of the mechanisms of chat AI such as ChatGPT is to obtain a language model based on statistics by mathematically counting a large number of existing natural language texts. This mechanism makes it very likely that chat AI will "talk nonsense seriously", which in turn leads to legal risks of false information dissemination. In this technical context, how to minimize the risk of false information dissemination by chat AI? In general, my country's legislation on artificial intelligence is still in the pre-research stage. There is no formal legislative plan or relevant motion draft. The relevant departments are particularly cautious about the supervision of the field of artificial intelligence. With the gradual development of artificial intelligence, the corresponding legal compliance problems will only increase. 1. ChatGPT is not a "cross-era artificial intelligence technology" ChatGPT is essentially a product of the development of natural language processing technology, and is essentially still just a language model. At the beginning of 2023, the huge investment of Microsoft, a global technology giant, made ChatGPT a "top stream" in the field of technology and successfully broke out of the circle. With the surge in the ChatGPT concept sector in the capital market, many domestic technology companies have also begun to lay out this field. While the capital market is enthusiastic about the ChatGPT concept, as legal workers, we can't help but evaluate what legal security risks ChatGPT itself may bring, and what is its legal compliance path? Before discussing the legal risks and compliance paths of ChatGPT, we should first examine the technical principles of ChatGPT-can ChatGPT, as the news says, give the questioner any questions he wants?In the eyes of the Sajie team, ChatGPT seems to be far less "magical" than some news reports have claimed. In a word, it is just an integration of natural language processing technologies such as Transformer and GPT. In essence, it is still a language model based on neural networks, rather than a "cross-era AI progress". As mentioned earlier, ChatGPT is a product of the development of natural language processing technology. From the perspective of the development history of this technology, it has roughly gone through three stages: grammar-based language model, statistical-based language model, and neural network-based language model. The stage where ChatGPT is located is the language model stage based on neural networks. If you want to understand the working principle of ChatGPT and the legal risks that may be caused by this principle more directly, you must first explain the working principle of the language model based on neural networks, which is the predecessor of the language model based on statistics. In the statistical language model stage, AI engineers determine the probability of successive connections between words by counting a huge amount of natural language text. When people ask a question, AI begins to analyze which words have a high probability of being matched in the language environment composed of the constituent words of the question, and then splices these high-probability words together to return a statistically based answer. It can be said that this principle has been running through the development of natural language processing technology since its emergence. In a sense, the language model based on neural networks that appeared later is also a correction to the language model based on statistics. To give an easy-to-understand example, the Sister Sa team entered the question "What tourist attractions are there in Dalian?" in the ChatGPT chat box as shown in the figure below: The first step of AI is to analyze the basic morphemes in the question "Dalian, which, tourism, resorts", and then find the natural language text set where these morphemes are located in the existing corpus, find the most likely collocation in this set, and then combine these collocations to form the final answer. For example, AI will find that in the corpus where the three words "Dalian, tourism, resorts" appear with a high probability, there is the word "Zhongshan Park", so it will return "Zhongshan Park", and for example, the word "park" has the highest probability of being matched with words such as gardens, lakes, fountains, and statues, so it will further return "This is a park with a long history, with beautiful gardens, lakes, fountains, and statues."In other words, the whole process is based on the probability statistics of the existing natural language text information (corpus) behind AI, so the answers returned are also "statistical results", which leads to ChatGPT's "serious nonsense" on many issues. For example, in the answer to the question "What are the tourist attractions in Dalian?", although Dalian has Zhongshan Park, there are no lakes, fountains or statues in Zhongshan Park. Dalian did have "Stalin Square" in history, but Stalin Square has never been a commercial square, and there are no shopping malls, restaurants or entertainment venues. Obviously, the information returned by ChatGPT is false. 2. ChatGPT is currently the most suitable application scenario for a language model. Although we have explained the drawbacks of statistical language models in the previous section, ChatGPT is, after all, a neural network-based language model that has been greatly improved from statistical language models. Its technical foundations, Transformer and GPT, are both the latest generation of language models. ChatGPT essentially combines massive amounts of data with a Transformer model with strong expressive power, thereby modeling natural language in great depth. Although the returned sentences are sometimes "nonsense", they still look like "human replies" at first glance. Therefore, this technology has a wide range of application scenarios in scenarios that require massive human-computer interaction. At present, there are three such scenarios: first, search engines; second, human-computer interaction mechanisms in banks, law firms, various intermediary agencies, shopping malls, hospitals, and government service platforms, such as customer complaint systems, medical guidance navigation, and government consultation systems in the above-mentioned places; third, interaction mechanisms for smart cars, smart homes (such as smart speakers and smart lights), etc. Search engines that combine AI chat technologies such as ChatGPT are likely to present a path of traditional search engines as the main and language models based on neural networks as the auxiliary. At present, traditional search giants such as Google and Baidu have deep accumulation in language model technology based on neural networks. For example, Google has Sparrow and Lamda, which are comparable to ChatGPT. With the support of these language models, search engines will be more "humanized".The use of AI chat technologies such as ChatGPT in customer complaint systems, hospital and shopping mall guidance navigation, and government consultation systems will greatly reduce the human resource costs of related units and save communication time, but the problem is that statistical answers may produce completely wrong content replies, and the resulting risk control risks may need further evaluation. Compared with the above two application scenarios, the legal risk of ChatGPT being used in smart cars, smart homes and other fields as the human-computer interaction mechanism of the above devices is much smaller, because the application environment in these fields is relatively private, and the erroneous content of AI feedback will not cause major legal risks. At the same time, these scenarios do not require high content accuracy, and the business model is more mature. 3. A preliminary exploration of the legal risks and compliance paths of ChatGPT First, the overall regulatory landscape of artificial intelligence in my country is similar to many emerging technologies. The natural language processing technology represented by ChatGPT also faces the "Collingridge dilemma". This dilemma includes information dilemma and control dilemma. The so-called information dilemma means that the social consequences brought about by an emerging technology cannot be predicted in the early stage of the technology; the so-called control dilemma means that when the adverse social consequences brought about by an emerging technology are discovered, the technology has often become part of the entire social and economic structure, making it impossible to effectively control the adverse social consequences. At present, the field of artificial intelligence, especially the field of natural language processing technology, is in a rapid development stage. This technology is likely to fall into the so-called "Collingridge dilemma", and the corresponding legal supervision does not seem to "keep up with the pace". At present, there is no national artificial intelligence industry legislation in my country, but there have been relevant legislative attempts at the local level. Just last September, Shenzhen announced the "Regulations on Promoting the Artificial Intelligence Industry in Shenzhen Special Economic Zone", a special legislation for the national artificial intelligence industry, and then Shanghai also passed the "Regulations on Promoting the Development of the Artificial Intelligence Industry in Shanghai". It is believed that similar artificial intelligence industry legislation will be introduced in various places soon. In terms of the ethical regulation of artificial intelligence, the National New Generation Artificial Intelligence Governance Professional Committee also issued the "Ethical Code for New Generation Artificial Intelligence" in 2021, proposing to integrate ethics into the entire life cycle of artificial intelligence research and development and application. Perhaps in the near future, the "Three Laws of Robotics" similar to Asimov's novels will become the iron law of supervision in the field of artificial intelligence.Second, the legal risk of false information brought by ChatGPT will shift the focus from the macro to the micro. Putting aside the overall regulatory landscape of the artificial intelligence industry and the ethical regulation of artificial intelligence, the real compliance issues existing in AI chat foundations such as ChatGPT also need to be paid attention to urgently. Among them, the more difficult one is the problem of false information replied by ChatGPT. As mentioned in the second part of this article, the working principle of ChatGPT causes its reply to be completely "serious nonsense". This kind of false information that seems to be true but is actually outrageous is extremely misleading. Of course, false replies to questions such as "What are the tourist attractions in Dalian" may not cause serious consequences, but if ChatGPT is applied to search engines, customer complaint systems and other fields, the false information it replies to may cause extremely serious legal risks. In fact, such legal risks have already appeared. The language model Galactica in the field of Meta service research, which was launched almost at the same time as ChatGPT in November 2022, was complained by users and taken offline after only 3 days of testing because of the problem of mixed true and false answers. Under the premise that the technical principles cannot be broken through in a short time, if ChatGPT and similar language models are applied to search engines, customer complaint systems and other fields, they must be transformed for compliance. When it is detected that users may ask professional questions, users should be guided to consult relevant professionals rather than looking for answers from artificial intelligence. At the same time, users should be reminded that the authenticity of the questions returned by chat AI may need further verification to minimize the corresponding compliance risks. Third, the intellectual property compliance issues brought by ChatGPT When turning the focus from macro to micro, in addition to the authenticity of AI reply information, the intellectual property issues of chat AI, especially large language models like ChatGPT, should also attract the attention of compliance personnel. The first compliance problem is whether "text data mining" requires corresponding intellectual property authorization. As mentioned above, the working principle of ChatGPT relies on a huge amount of natural language text (or speech corpus). ChatGPT needs to mine and train the data in the corpus. ChatGPT needs to copy the content of the corpus to its own database. The corresponding behavior is usually called "text data mining" in the field of natural language processing.When the corresponding text data may constitute a work, whether text data mining infringes the right of reproduction is still controversial. In the field of comparative law, Japan and the European Union have expanded the scope of fair use in their copyright legislation, adding "text data mining" in AI as a new fair use situation. Although some scholars advocated changing my country's fair use system from "closed" to "open" during the revision of my country's Copyright Law in 2020, this proposition was not adopted in the end. At present, my country's Copyright Law still maintains the closed provisions of the fair use system. Only the thirteen situations stipulated in Article 24 of the Copyright Law can be recognized as fair use. In other words, my country's Copyright Law currently does not include "text data mining" in AI in the scope of reasonable application, and text data mining in my country still requires corresponding intellectual property authorization. The second compliance problem is whether the response generated by ChatGPT is original? Regarding the question of whether the work generated by AI is original, the Sister Sa team believes that its judgment criteria should not be different from the existing judgment criteria. In other words, whether a certain response is completed by AI or humans, it should be judged according to the existing originality standards. In fact, behind this question is another more controversial question. If the reply generated by AI is original, can the copyright owner be AI? Obviously, under the intellectual property laws of most countries, including my country, the author of a work can only be a natural person, and AI cannot be the author of a work. Finally, if ChatGPT splices third-party works in its replies, how should its intellectual property issues be handled? The Sister Sa team believes that if ChatGPT splices copyrighted works in the corpus into its replies (although based on the working principle of ChatGPT, the probability of this happening is small), then according to China's current copyright law, unless it constitutes fair use, it is not necessary to obtain authorization from the copyright owner before copying.