Training AI models needs huge amounts of data sets, and their ability to produce good results directly depends on the data the system has been fed. Information does not come free of charge, and we are talking here about a lot of intellectual property rights.
But AI firms don’t think along these lines; they take all the knowledge produced by generations of writers for granted; their fair use concept is also different from how it was perceived in the first place; and they don’t like paying to the creators of content that made their models of what they are capable of today.
Theft of human knowledge
There is a lot of hard work and sweat involved in producing the content that we see in newspapers, magazines, books, online archives, and research papers, but that is not possible without writers, editors, researchers, and publishers who brought that to the public in different forms.
Such hard-earned recognition and knowledge should not be free to be exploited by a company, as one did.
“Information that is publicly available on the internet.”
Source: OpenAI.
Yes, that is what OpenAI has to say if asked about the content it used to train its AI systems, along with the information that it licensed from third parties and the information that their users and human trainers provide.
Speaking of the licensed content, companies are seeking it now, but we don’t have any information about whether OpenAI licensed any information from a vendor before it launched its initial GPT model. The model must have been trained on copyrighted materials that were not free to use for commercial purposes.
Source: Statista. Compensation for original creators
Until a year ago, most of the text written online or offline was done with human effort. Despite the click bait, low-quality content was also mixed in, but it was at least created by humans who understood the human psyche and thinking process, and generative AI applications were built on the basis of such information.
But today, companies are facing a new problem for training their AI models, and that is the machine-generated content prevailing over the entire internet, which is not considered quality content by any means. Such content is plaguing the resources available for training AI models as they can not produce quality output when trained on useless verbose which is how these models churn out content usually. AI churning on AI is a process often called AI cannibalism or cloning.
To prevent this from happening, AI firms have to limit their source material to credible sources only, which are none other than newspapers, magazines, and public forums that host a wealth of human-produced knowledge. A few more can also be counted, as mentioned above, but this necessity and lawsuits from newspapers have forced them to license content and pay for the exploitation they were doing.
Companies like Reddit, which is a large web-hosted public forum, are also considering licensing their content to AI firms. In a statement, it said that they would prefer business over lawsuits but did not rule out lawsuits if business conversations fail. If you are not allowed to put a copyrighted soundtrack on your Youtube video, then why should an AI company be allowed to use that for training their models intended for commercial use?
Copyright ownership is a problem here, as AI firms keep violating it. On the other hand, AI is not capable of gathering new news on its own, it takes human effort to gather news and confirm from different sources in the first place before publishing it, only then can an AI model use that information, and not compensating the human resource in this case is an exploitation.