Written by: Jiang Jiang

Editor: Manman Zhou

The emergence of ChatGPT and the explosive adoption of Midjourney enabled AI to achieve its first large-scale application, namely the popularization of large models.

The so-called big model refers to a machine learning model with a large number of parameters and complex structure, which can process massive data and complete various complex tasks.

01 AI data copyright disputes

If we compare the current AI models to cars, the raw data is the crude oil. In any case, the AI ​​models need enough “crude oil” first.

The sources of “crude oil” for AI companies are mainly the following:

  • Public and free data sources on the Internet, such as Wikipedia, blogs, forums, news information, etc.;

  • established news media and publishing houses;

  • Universities and other research institutions;

  • C-end users using the model.

In the real world, there are already mature legal regulations for oil ownership, but in the chaotic field of AI, the rights to exploit “crude oil” are still unclear, and there are countless disputes caused by this.

Just recently, several major music labels sued AI music production companies Suno and Udio, alleging copyright infringement, a lawsuit similar to the one The New York Times filed against OpenAI last December.

Image source: Billboard

In July 2023, a group of writers filed a lawsuit against the company, alleging that ChatGPT generated summaries of authors’ works based on copyrighted content.

In December of the same year, The New York Times also filed a similar copyright infringement lawsuit against Microsoft and OpenAI, accusing the two companies of using the newspaper's content to train artificial intelligence chatbots.

In addition, a class-action lawsuit was filed in California, accusing OpenAI of obtaining users' private information from the internet to train ChatGPT without their consent.

OpenAI ultimately did not pay for the accusation. They said they disagreed with the New York Times' accusations and could not reproduce the problems mentioned by the New York Times. More importantly, the data source provided by the New York Times was not important to OpenAI.

Source: https://openai.com/index/openai-and-journalism/

For OpenAI, the biggest lesson from this incident may be to handle the relationship with data suppliers well and clarify the rights and responsibilities of both parties. As a result, we have seen OpenAI reach partnerships with many data suppliers in the past year, including but not limited to The Atlantic, Vox Media, News Corp, Reddit, Financial Times, Le Monde, Prisa Media, Axel Springer, American Journalism Project, etc.

In the future, OpenAI will legitimately use the data of these media, and these media will also integrate OpenAI's technology into their products.

02 AI drives content platform monetization

However, the fundamental reason for OpenAI to reach a partnership with data suppliers is not the fear of being sued, but the impending data exhaustion faced by machine learning. A study conducted by researchers such as MIT estimated that machine learning datasets may exhaust all "high-quality language data" by 2026.

Therefore, “high-quality data” has become a hot commodity for model manufacturers such as OpenAI and Google. Content companies have repeatedly reached cooperation with AI model manufacturers to start a model of making money without doing anything.

Traditional media platform Shutterstock has reached cooperation with AI companies such as Meta, Alphabet, Amazon, Apple, OpenAI, and Reka. In 2023, it will increase its annual revenue to $104 million by licensing content to AI models, and it is expected to generate $250 million in revenue in 2027. Reddit's content copyright revenue licensed to Google is as high as $60 million per year. Apple is also seeking to cooperate with mainstream news media and offer a copyright fee of at least $50 million a year. The copyright fees received by content companies from AI companies are rising wildly at an annual growth rate of 450%.

Image source: CX Scoop

In the past few years, it has been difficult to monetize content other than streaming media, which has been a major pain point in the content industry. Compared with the Internet startup era, the emergence of AI has brought greater imagination and stronger revenue expectations to the content industry.

03 High-quality data is still scarce

Of course, not all content meets the needs of AI.

Another highlight of the aforementioned debate between OpenAI and The New York Times is data quality. To extract oil from crude oil, the oil itself must be of good quality, and the purification technology must be good.

OpenAI specifically emphasized that the content of the New York Times did not make any significant contribution to OpenAI's model training. Compared with Shutterstock, which allows OpenAI to pay tens of millions of dollars every year, text media such as the New York Times, which rely on timeliness, are not the darlings of the AI ​​era. AI needs more profound and unique data.

However, high-quality data is too scarce, and AI companies have begun to work on "purification technology" and "one-stop application".

On June 25, OpenAI acquired Rockset, a real-time analysis database company. The company mainly provides real-time data indexing and query functions. OpenAI will integrate Rockset's technology into its products to improve the real-time use value of data.

Image source: DePIN Scan

By acquiring Rockset, OpenAI plans to enable AI to better utilize and access real-time data. This will enable OpenAI's products to support more complex applications, such as real-time recommendation systems, dynamic data-driven chatbots, real-time monitoring and alarm systems, etc.

Rocket is OpenAI's built-in "petrochemical department" that directly converts ordinary data into high-quality data required for applications.

04 Is it a fantasy to confirm the ownership of creators’ data?

The data of Internet media platforms (Facebook, Reddit, etc.) largely comes from UGC, that is, user-contributed content. While many platforms charge AI companies high data fees, they also quietly add a clause to their user terms that "the platform has the right to use user data to train AI models."

Although the user terms and conditions have marked the rights to train AI models, many creators do not know which models are using their content, nor do they know whether it is paid for, and they have no way of obtaining the relevant rights that should belong to them.

During Meta’s quarterly earnings call in February, Zuckerberg made it clear that he would use images from Facebook and Instagram to train his AI generation tools.

Tumblr has also reportedly reached mysterious content licensing agreements with OpenAi and Midjourney, but has not disclosed the specific terms of the agreements.

Creators on the photo library platform EyeEm also recently received a notice that the photos they posted would be used to train AI models. The notice mentioned that users can choose not to use the product for this reason, but no compensation policy has been mentioned. EyeEm's parent company Freepik told Reuters that the company has signed agreements with two large technology companies to license most of its 200 million images at a price of about 3 cents per image. CEO Joaquin Cuenca Abela said there are five similar deals in progress, but declined to disclose the identity of the buyers.

UGC-dominated content platforms such as Getty Images, Adobe, Photobucket, Flickr, and Reddit all face similar problems. Under the huge temptation of data monetization, the platforms choose to ignore users' content ownership and package and sell the data to AI model companies.

The entire process is carried out in secret, and creators have no chance to resist. Many creators may not even have the opportunity to suspect that their previous works were sold to AI companies for model training until a model is trained to produce content similar to their own works one day in the future.

Web3 may be a good choice to solve the problem of creators' data rights confirmation and income protection. When AI companies hit new highs in the US stock market, web3's AI concept coins also soared at the same time. Blockchain, with its decentralized and tamper-proof characteristics, enjoys a unique advantage in protecting the rights of creators.

Media content such as pictures and videos have already been widely adopted on the blockchain in the bull market of 2021, and the on-chain adoption of UGC content on social platforms is also happening quietly. At the same time, many web3 AI model platforms are already incentivizing ordinary users who contribute to model training, both data owners and trainers.

The exponential development of AI models has put forward a greater demand for data rights confirmation. Creators should think: Why was my work sold to an AI model company for 5 cents a piece without my consent? Why was I unaware of the whole process and unable to get any benefits?

Even if media platforms exhaust all resources, they cannot alleviate the data anxiety of AI model companies. The prerequisite for achieving high-quality data and high output is data rights confirmation and the reasonable distribution of interests among creators, platforms and AI model companies.