Author: Li Jin, Partner at Variant Fund; Translation: Jinse Finance xiaozou

Recent high-profile data licensing deals, such as those between OpenAI and News Corp and Reddit, have highlighted the need for high-quality data for artificial intelligence (AI). Cutting-edge large models have been trained on much of the Internet—for example, Common Crawl indexes about 10% of the web for LLM training, containing more than 100 trillion tokens.

One way to further improve AI models is to expand and enhance the data they can use to train. We have been discussing mechanisms for aggregating data — especially in a decentralized way. We are particularly interested in exploring how decentralized approaches can help generate new datasets and provide economic rewards to contributors and creators.

One of the most discussed topics in the crypto space in recent years is the concept of a data DAO, a group of people that creates, organizes, and manages data. This topic has been discussed by Multicoin and others, but the rapid development of artificial intelligence has given rise to a new question about data DAOs - "Why now?"

In this article, we will share our thoughts on Data DAOs in order to answer the question: How can Data DAOs accelerate the development of AI?

1. Data status in the AI ​​field

Today, AI models are trained on public data, either through collaborations like News Corp and Reddit, or by collecting data on the open internet. Meta’s Llama 3, for example, was trained using 15 trillion tokens from public sources. These methods are effective at aggregating large amounts of data quickly, but they have limitations in what kind of data can be collected and how it can be collected.

First, what data should be collected: The development of artificial intelligence is bottlenecked by the quality and quantity of data. Leopold Aschenbrenner once wrote about the "data wall" that limits further improvement of algorithms: "Soon, the naive approach of pre-training larger language models on more scraped data may start to encounter serious bottlenecks."

One way to knock down the data wall is to open up the availability of new data sets. For example, model companies cannot scrape login data without violating the terms of service of most websites, and by definition, they cannot access data that has not been collected. There is also a large amount of private data that AI training cannot currently access: companies such as Google Drive and Slacks, personal health data, or private information.

Second, how the receipts are collected: Under the existing model, data collection companies capture most of the value. Reddit’s S-1 lists data licensing as a major expected revenue source: “We expect our growing data advantage and intellectual property will continue to be a key element of future LLM training.” The end users who generate the actual content do not receive any financial benefits from these licensing agreements or the AI ​​models themselves. This misalignment can stifle participation—there are already movements to sue generative AI companies or opt out of training datasets. Not to mention the socioeconomic impact of concentrating revenue in the hands of model companies or platforms without giving a cent to end users.

2. Data DAO Effect

The data problems described above share a common thread: they benefit from a large number of contributions from a diverse, representative sample of users. The value of any single data point to model performance may be negligible, but collectively, a large group of users can aggregate new datasets that are valuable for AI training. This is where the concept of a Data DAO comes in. With a Data DAO, data contributors can reap the economic benefits of providing data and manage how the data is used and monetized.

Where can Data DAOs contribute to the current data landscape? Here are some ideas — note that this is not an exhaustive list and there are certainly other opportunities for Data DAOs:

(1) Real-world data

In the field of decentralized physical infrastructure (DEPIN), networks such as Hivemapper aim to collect up-to-date global map data from around the world by incentivizing dashcam owners to contribute their data, and users to contribute data through their applications (e.g. data about road closures or repairs). DEPIN can be thought of as a real-world data DAO, where datasets are generated from hardware devices and/or user networks. This data has commercial value to many companies, and revenue will be returned to contributors in the form of token rewards.

(2) Personal health information

Biohacking is a social movement in which individuals and communities take a DIY approach to studying biology, often by experimenting on themselves. For example, a person might take different nootropics to improve brain performance, or test different treatments or environmental changes to improve sleep, or even inject themselves with experimental drugs.

Data DAOs could bring incentives to these biohacking efforts by organizing participants around common experiments and systematically collecting results. Revenue earned by these personal health DAOs, such as from research labs or pharmaceutical companies, could go back to participants who contributed results in the form of their own personal health data.

(3) Reinforce learning with human feedback

Fine-tuning AI models using RLHF (Reinforcement Learning with Human Feedback) involves leveraging human input to improve the performance of an AI system. Typically, feedback is expected to come from experts in their respective fields who can effectively evaluate the output of the model. For example, a lab might seek the help of a PhD in mathematics to improve the mathematical abilities of their LLMs, and so on. Token rewards can play a role in finding and incentivizing expert participation through their speculative advantages, not to mention the global access afforded by using crypto payment rails. Companies such as Sapien, Fraction, and Sahara are all working in this space.

(4) Private data

As public data available for AI training becomes scarcer, the basis of competition may shift to proprietary datasets, including private user data. There is a large amount of high-quality data that remains inaccessible behind login walls, such as private messages, private files, etc. This data can not only be effectively used to train personal AI, but also contains valuable information that is not accessible on the public network.

However, accessing and leveraging this data presents significant legal and ethical challenges. Data DAOs can provide a solution by allowing willing participants to upload and monetize their data and manage how the data is used. For example, the Reddit Data DAO allows users to upload Reddit data that they have exported from the Reddit platform, which contains comments, posts, and voting histories, which can be sold or rented to AI companies in a privacy-preserving manner. Token incentives allow users to earn income not only through one-time transactions, but also based on the value created by AI models trained using their data.

3. Open issues and challenges

While the potential benefits of Data DAOs are enormous, there are also some considerations and challenges.

(1) Distorted Influence of Incentives

One thing we can see from the history of using token incentives in Crypto is that external incentives can change user behavior. This has a direct impact on using token incentives to achieve data purposes: incentives may distort the group of participants and the type of data they contribute.

The introduction of token incentives also introduces the possibility for participants to seek loopholes in the system, such as submitting low-quality or fabricated data to maximize their income. This is important because the revenue opportunities of these data DAOs depend on the quality of the data. If contributions deviate from the target, it will destroy the value of the dataset.

(2) Data measurement and rewards

The core idea of ​​a Data DAO is to reward contributors for their data submissions with token incentives, which in the long run will become the revenue earned by the DAO. However, given the subjective nature of data value, knowing exactly how much to reward various data contributions is extremely challenging. In the example above about biohacking, for example: is the data of some users more valuable than that of others? If so, what are the determining factors? For map data: is map information in some regions more valuable than others? How can this difference be quantified? (There is active research on measuring the value of data in AI by calculating the data's incremental contribution to model performance, but this approach can be computationally intensive.)

Additionally, it is critical to establish robust mechanisms to verify the authenticity and accuracy of data. Without these measures, the system may be vulnerable to fraudulent data submission (such as creating fake accounts) or Sybil attacks. The DEPIN network attempts to address this problem by integrating at the hardware device level, but other types of data DAOs that rely on user contributions may be vulnerable to manipulation.

(3) New data increment

Most of the open web is already being used for training purposes, so Data DAO operators must consider whether datasets collected in a distributed manner are truly incremental and additional to existing data on the open web, and whether researchers can obtain this data from the platform or through other means. The above ideas highlight the importance of collecting brand new data that goes beyond existing data, leading to the next considerations: impact size and revenue opportunities.

(4) Evaluate revenue opportunities

Essentially, Data DAOs are building a two-sided market that connects data buyers and data contributors. Therefore, the success of a Data DAO depends on attracting a stable and diverse base of customers who are willing to pay for data.

Data DAOs need to identify and validate their end demand and ensure that the revenue opportunity is large enough (either on an aggregate basis or on a per-contributor basis) to incentivize the quantity and quality of data required. For example, the idea of ​​creating a user data DAO to aggregate personal preference and browsing data for advertising purposes has been discussed for several years, but ultimately, the revenue such a network would be able to pass on to users is likely to be minimal. (For reference, Meta’s global ARPU at the end of 2023 is $13.12.) With AI companies planning to invest trillions of dollars in training, the data revenue distributed to each user may be large enough to attract large-scale contributions, which raises an interesting question for data DAOs: “Why now?”

4. Overcoming the data wall

Data DAOs represent a potentially promising way to generate new, high-quality datasets and break down the data wall in AI. How this will pan out remains to be seen, but we’re excited to see how this space develops.