rounded

Written by IOSG Ventures

 

TL;DR

 

Data is as critical to artificial intelligence as gasoline is to cars. In the era of artificial intelligence, data contains enormous value, but currently this value has not been used transparently and responsibly because many large technology companies often obtain data without user consent, thereby intercepting a large amount of potential value.

 

  • The current problems facing AI data include opaque data collection, untraceable sources, data owners not being properly compensated, privacy risks, difficulty in collection, scarcity of high-quality data, missing specific data, and insufficient supply of real-time data.

  • Web 3 and cryptocurrency technologies are committed to enhancing the security of AI data, the interpretability of models, and the supervision of data quality through tokenized incentives, data monetization, privacy protection, etc., to ensure that the economic benefits of the data belong to the real owners and that the use of data complies with ethical standards.

  • At the intersection of AI and cryptocurrency, companies are strengthening collaboration through vertical expansion and strategic alliances, which is particularly common in the early stages of industry development. These collaborations are critical to driving widespread adoption of crypto AI solutions.

  • In the future, artificial intelligence and blockchain technology will tend to develop in a "modular" manner. Data solutions driven by blockchain technology will become the key to promoting the development of artificial general intelligence (AGI) to a higher level.

 

1. Data: The fuel of AI

 

Last week, the launch of OpenAI's GPT-4o and Google's Project Astra once again fueled the craze for artificial intelligence. The female-voiced artificial intelligence assistant depicted in the science fiction movie "Her" has almost become a reality!

 

The AI ​​boom in recent years has become an important engine driving innovation in multiple industries. Blockchain technology is not far behind, as evidenced by the strong performance of AI tokens so far this year, with a 98% growth rate ranking fourth among all token categories.

 

人工智能领域

 

Recent advances in the field of artificial intelligence have been largely driven by progress in the development of various large language models (LLMs).

 

The performance of large language models (LLMs) is mainly determined by three key factors:

 

  • Model

  • data

  • Calculate ability

 

人工智能领域Source:IOSG Ventures

 

The core of artificial intelligence is the underlying model that supports it. These models are like cars. There are many different brands and types (such as open source or closed source models). Each model has its own unique advantages: like cars, some are faster, and some have a better driving experience. In general, they all greatly facilitate our daily lives.

 

人工智能领域

Source:Michael Dempsey

 

Just as the performance of AI models determines the level of AI intelligence, the intensity of computing and the quality of data are the key drivers of AI models. Continuing with the car analogy, computing power is like the engine of a car, and data is the fuel needed to start the car. Together, they form the basic elements necessary to achieve AI intelligence and play two important cost factors in the cost structure of many AI companies. According to LXT's report, 59% of the AI ​​budget is spent on data. Therefore, a large amount of data reserves has actually become a moat for many AI companies.

 

If computing power is the engine of large language models (LLMs), then data is the fuel for these models.

 

In an environment with unlimited computing resources, if the current largest data set can be expanded 100 times (from 1 trillion tokens to 100 trillion tokens), the prediction error of the model will be greatly reduced.

 

人工智能领域Source:dynomight.net

 

As the accuracy of AI predictions improves with more training data, people are paying more and more attention to the quality of data rather than the quantity. A 2022 analysis showed that new high-quality text data may gradually decrease in the next few years. Therefore, data quality will become particularly critical.

 

“What are the main factors hindering the popularization of artificial intelligence? Two problems: shortage of data and talent.” — Andrew Ng, former director of Stanford University’s Artificial Intelligence Laboratory.

 

2. Data bottleneck for AI

 

人工智能领域

Source: <Towards Data Science> Gadi Singer

 

In order to build the coveted powerful Large Language Models (LLMs), we must have data input at each stage of pre-training, training, fine-tuning, and inference.

 

Currently, Large Language Models (LLMs) are trained using publicly available data that is converted into tokens (a token is the smallest unit when segmenting and encoding the input text). This data covers a significant portion of all published books and the entire content of the Internet, hence the name "Large Language Model". With new public information being generated every day, this has also led to a corresponding increase in the number of parameters in the latest models.

 

Ironically, many training corpora from public web data are controlled by large AI companies that are quite secretive about their data collection.

 

Large language models like GPT-3 are very vague about their public data sources and collection process. In the GPT-3 paper, it briefly describes Books1 and Books2 (its two main sources) as "two Internet-based book corpora."

 

Therefore, whether it is an open source or closed source model, we have no way to verify the exact source of data used in AI model training. The provenance of data in AI models is completely a black box. This means that users cannot know whether their personal information is collected and whether the data is protected. If there is a problem with the AI ​​model, the unclear source of the data also makes it difficult to determine the responsibility for the problematic data, and it is difficult for users to understand the basis for the model's decision.

 

This is why AI is dominated by big tech giants, because they control the data generated by their users. Google can see an individual’s search queries, Meta can see what they share, and Amazon can see their purchases. This gives them an omniscient view of user activity within their respective markets.

 

Some tech giants even treat user-generated data as their own private property and sell it at a high profit, while the creators of the data get nothing. Recently, Reddit reached a $60 million training data deal with Google. The original data owner has no way to stop this, nor can they prevent their private information from being leaked. You may also be wondering, network data is public, can I crawl all the content myself with a crawler? In theory, it is possible, the world is full of all kinds of data. According to a report by market research firm IDC, a total of 33 zettabytes of data was generated worldwide in 2018, enough to fill 7 trillion DVDs.

 

Unfortunately, in order to prevent DDOS attacks, websites usually implement speed limits on large-scale web crawling activities using data centers such as AWS, or set up defensive measures such as honeypots. Even if we manage to circumvent the website's security measures and successfully crawl the data, data labeling is still an inevitable step. Compared with web crawling, data labeling is a more labor-intensive and manual process.

 

Despite the availability of non-profit open repositories such as Common Crawl and Web 2 annotation solutions such as Scale AI, the quality of their data and data labels is not always guaranteed, which often leads to biased models that replicate stereotypes and distort facts.

 

If real-world data is too difficult to obtain, another option is to make up some data yourself. To fine-tune the visual recognition model for its Go stores, Amazon used graphics software to create virtual shoppers, who were used to simulate potential extreme situations that might occur when some consumers shop without salespeople. These extreme situations did not actually occur before the Go stores went online, but may exist after the Go stores actually go online. However, there are both advantages and disadvantages to using synthetic data to train AI. The main advantage of synthetic data is scalability scenarios, such as the unmanned store shopping scene simulation of Amazon Go stores. Another is that synthetic data can be purified to eliminate any potential personal information or unintentional bias. Of course, the obvious disadvantage is that synthetic data may lack the complexity and nuances of the real world, causing the model to not perform well in real scenarios.

 

The timeliness of data is also a factor to consider. Many times, the data collected may be one-off and may not reflect the changing world. This is a challenge for AI models because they are susceptible to "drift," that is, their accuracy gradually decreases as the way the world works changes. For example, during the COVID-19 epidemic, some facial recognition models, which are used to recognizing unobstructed faces, encountered difficulties in the epidemic situation where people generally wear masks.

 

To summarize the bottleneck problems of artificial intelligence data:

 

  • Lack of transparency in data collection

  • The source of data in AI models cannot be traced

  • Data owners are not fairly compensated

  • User data privacy at risk

  • There is a lot of data but it is difficult to collect

  • High-quality data is scarce

  • The specific data required may not be available

  • Lack of real-time data supply

 

Fortunately, thanks to blockchain, we have a good solution.

 

3. Blockchain empowers AI data

 

Obviously, AI is great at interpreting and reasoning about data, and once you have the data, it can work. In blockchain technology, token incentives play an excellent role in large-scale crowdfunding data collection and resource sharing, and cryptography within the blockchain has shown great capabilities in ensuring data security.

 

Therefore, in order to solve the bottleneck of AI data, a large number of encrypted data projects have emerged recently. These projects cover data quality assurance, data labeling and encryption, simplify data collection, maintain data quality, protect data privacy, and enhance the verifiability of AI-generated results.

 

人工智能领域Source: IOSG Ventures

 

3.1 Data Storage

 

As the amount of data increases, the structured data required for AI training needs to be stored in a library so that it can be used at any time. Decentralized data storage, such as Arweave, Filecoin, and STORJ, solves the single point of failure problem of centralized storage. In February of this year, Arweave launched Arweave AO, which provides trustless collaborative computing services without scale restrictions. AO is able to store large amounts of data, such as AI models, and allows multiple parallel processes to run in a computing unit, collaborating with other units through open messaging without relying on centralized memory space.

 

3.2 Data Infrastructure Toolkit

 

Sahara builds an L1 blockchain for individuals or enterprises to freely and securely deploy personalized autonomous AI. It provides all data-related infrastructure, including community-built knowledge bases, training datasets, data storage, data ownership, and data toolkits (collection, annotation, quality assurance, etc.).

 

3.3 Public Network Data

 

Take the Grass protocol as a prime example. Grass is a web crawler protocol. It consists of a network of 2 million devices that scrape Internet data in real time. It cleans the data into a structured vector format for use by AI companies.

 

To contribute to the network, users simply install a browser extension on their home network device, which then uses their internet bandwidth to scrape data from websites. Currently, users are paid in Grass points, and in the future they will capture revenue in tokens, thereby gaining real value from their data contributions.

 

On the Grass network, users only exchange their unlimited local broadband through extension programs to become distributed nodes in the Grass network, thereby achieving large-scale public network data crawling. Because it is a distributed node, and each node uses a residential broadband network (Residential Network) instead of a centralized data center network to send website access requests for crawled data, users are not easily affected by defense measures such as website speed limits and honeypots.

 

In addition, Grass nodes do not crawl data behind login walls, thus avoiding legal issues related to accessing private data. All collected data comes from the public Internet, which enhances the legality and privacy of the process. Continuous crawling of network data also means that data can be provided in real time, preventing the "drift" phenomenon in artificial intelligence models.

 

3.4 Industry-specific data

 

Simply scraping public internet data is usually not enough. To further train LLM models that can make good predictions, we need to provide them with more domain-specific data during the training phase. This contextual data usually comes in the form of private data and/or blockchain data.

 

A large amount of private data is generated every day. It is not easy for large centralized companies to make use of this data. For example, Google and Meta were fined heavily for violating GDPR rules and not handling private data properly. However, training only on public data will limit the performance of LLM models.

 

Fortunately, token incentives facilitate the democratization of access to high-quality training data.

 

A typical example is the Ocean Protocol. It aims to facilitate the exchange and monetization of data between businesses and individuals, while ensuring that the data does not leave the provider who stores the data. All provided data is tokenized into datatokens, and the provider of datatokens is rewarded with OCEAN tokens.

 

3.5 Data cleaning and annotation

 

This token-incentivized crowdsourcing logic also applies to data cleaning and labeling, which are extremely labor-intensive tasks in the Web 2 era.

 

“Cognilytica says that in a typical AI project, various data processing tasks take up about 80% of the time. Training machine learning systems requires a large number of carefully labeled samples, which are usually done manually.”

 

In the Web 3 era, we can easily outsource these tasks to the public by providing the Gamfi experience of X to earn. Projects such as Sapien and PublicAI are actively working on this. Especially with Grass about to launch its own data annotation service, the competition will become more intense.

 

3.6 Blockchain Data

 

To enrich AI models with blockchain-specific data, indexers and decentralized data warehouse solutions like Covalent and Space and Time provide high-quality blockchain data to machine learning developers through unified APIs and SDKs.

 

3.7 Data Privacy and Verifiability

 

A major concern during model training and inference is how to ensure that the data involved remains private. This concern includes issues related to data input, transmission of weight data, and data output.

 

Several new cryptographic solutions have emerged to address this challenge, and Bagel provides a nice comparison chart:

 

人工智能领域

Source: Bagel Blog

 

Federated Learning (FL) and Fully Homomorphic Encryption (FHE) are both good solutions for protecting data privacy during training.

 

Flock.io is a well-known project dedicated to Federated Learning (FL). It ensures privacy because local data on local servers is never shared and all computations are done locally. Therefore, it is a distributed machine learning framework. Although Federated Learning ensures the privacy of training data, recent studies have shown that Federated Learning may have the risk of data leakage and the global model is not private because it is shared between each local server. Therefore, the weights and gradients aggregated at each step are also shared.

 

Fully homomorphic encryption (FHE) allows computations to be performed on encrypted data. Since everything is encrypted, training data and model weights are also kept private. As a result, FHE becomes invaluable in use cases such as healthcare or finance, as data remains secure while computations are performed. Notable FHE projects include Zama, Bagel, Fhenix, Inco, Sunscreen, and Privasea. The downsides of FHE are speed and verifiability, as users must trust that the encrypted data is correct.

 

The biggest advantage of ZKML is that it can verify the computational output while keeping the model weights private, which makes it particularly useful in model reasoning. It generates zero-knowledge proofs that guarantee the correct execution of training or reasoning, and there is no trust assumption on the data owner. Projects working on ZKML include Modulus, Giza, and EZKL.

 

It is worth noting that while Federated Learning (FL) and Fully Homomorphic Encryption (FHE) are more commonly used for training purposes, and Zero-Knowledge Machine Learning (ZKML) is typically used for inference, their usage is actually flexible and can be used for either technique in training or inference.

 

3.8 RAG (Retrieval Enhanced Generation)

 

In the AI ​​reasoning stage, a dangerous trap is "model hallucination". It refers to the fact that the text generated by the large language model (LLM) is coherent, but the text contains wrong or fabricated information that does not match the facts or user needs.

 

This phenomenon is usually caused by the model not being exposed to external knowledge data during training or fine-tuning. A common solution is to provide contextual data to re-fine-tune the LLM. However, this process can be very time-consuming and usually requires retraining the model. Therefore, a simpler solution, RAG (Retrieval-Augmented Generation), was invented.

 

RAG (Retrieval Augmented Generation) can effectively help developers because they do not need to constantly train their models with new data, thus reducing computational costs. RAG allows any AI model (such as LLM) to retrieve relevant information from external knowledge sources (even if this information is not in its training data) and generate more accurate and contextual answers, thereby reducing the generation of false information.

 

External knowledge data is stored in the form of vector embeddings in a vector database. A major advantage of RAG is that it ensures that users can access the source of model data and can verify the accuracy of generated results.

 

Research shows that LLM models using RAG significantly outperform the same model without RAG.

 

人工智能领域

 

An innovative Web 3 solution that has important applications for RAG is Dria. Dria is a solution that runs on Ethereum Layer 2 (L2) as a vector database (stored on Arweave) and provides a token-incentivized market for external knowledge datasets.

 

After observing the AI ​​and encrypted data stack, it is clear that Web 3 projects are increasing the value of data in AI in the following ways:

 

  • data collection

  • No Web 3: Unable to collect data at scale, relying on third-party APIs, paying exorbitant fees, or using only non-profit data.

  • There is Web 3: large-scale, globally accessible collection of data for specific needs enabled by token-incentivized crowdsourcing of data.

  • Data Monetization

  • No Web 3: Users cannot derive value from their data contributions.

  • There is Web 3: data is tokenized and monetized, returning the value of the data to the data owner.

  • Privacy Enhancement

  • No Web 3: Data sharing during AI model development may raise privacy issues.

  • There is Web 3: Data owners maintain control over their private and personal data, which is not leaked during training, fine-tuning, or inference.

  • Explainable AI

  • No Web 3: Unable to manage and verify the provenance of datasets and model results.

  • There is Web 3: Helping understand the origin of data, ensuring that data is legally authorized, and enabling users to confidently implement models and verify model outputs.

  • Data quality

  • Without Web 3: The quality of the collected data cannot be guaranteed, requiring an internal data verification team or outsourcing to a third party, which increases huge operating costs.

  • There is Web 3: data verification through token rewards; validators who fail to adhere to data quality standards will face token slashing penalties.

 

As Vitalik highlighted in his AI x Crypto article:

 

  • AI provides highly “centralized” intelligence

  • Blockchain provides a high degree of “decentralization” and trustlessness

  • AI x Crypto = Trustlessness + Intelligence, which elevates AI from “will not do evil” to “cannot do evil”

 

4. Trends and Outlook

 

As competition in the AI ​​x Crypto field intensifies, a significant trend is the increasing frequency of cooperation and integration between projects to expand the share of the new AI x Crypto market. Some examples are as follows:

 

4.1 Cooperation between upstream and downstream industries: Kaito runs a subnet on Bittensor

 

Question: How to provide reliable search services in a decentralized environment?

 

Solution: Kaito is a Web3-enabled AI search platform that builds an infrastructure layer for the Bittensor ecosystem. In March of this year, Kaito released a subnet called OpenKaito. OpenKaito is a decentralized search indexing layer with transparent search ranking and scalability design. Other subnets can query specific field information, and miners are incentivized by providing ranked lists and use computing power to enhance data acquisition, indexing, ranking, and knowledge graphs. To prevent forged results, validators verify the URL of the search results to ensure that it is consistent with the original source. Miners are rewarded based on the authenticity, relevance, timeliness, and diversity of their results.

 

4.2 Collaboration between competitors: Integration of Privasea and Zama’s FHE algorithms

 

Question: How to enhance the privacy and security of AI operations in a blockchain environment?

 

Solution: Privasea and Zama are collaborating to use each other's technology. Under Zama's authorization, Privasea can now use Zama's TFHE-rs library in its network to enhance the privacy and security of AI operations. Privasea plans to build blockchain-based private AI applications based on Zama's Concrete ML. These tools will be used for tasks such as face recognition, medical image analysis, and financial data processing.

 

4.3 Integration of the entire vertical supply chain: Token merger of SingularityNet, Fetch.AI and Ocean Protocol

 

Question: How to improve the market competitiveness and synergy of the project by merging tokens?

 

Solution: On March 27, 2024, SingularityNet, Fetch.AI, and Ocean Protocol announced a $7.5 billion token merger. The merged Fetch.AI (FET) token will become the ASI token with a total supply of 2.6 billion. SingularityNet (AGIX) and Ocean (OCEAN) tokens will be converted to ASI tokens at a ratio of approximately 0.43:1. The merged token is named ASI, which stands for Artificial Superintelligence Alliance. The ASI token is scheduled to be officially launched on May 24.

 

4.4 AI and the Future of Encryption Technology

 

Some believe that the AI ​​competitive landscape may eventually return to familiar territory of a largely duopoly market dynamic, as was the case with Android and iOS, dominated by a dominant open source model and a dominant closed source model in their respective categories.

 

Regardless of the debate about open source models vs. closed source models, I believe the future of AI will be a world of multi-model reasoning.

 

One specific implementation of multi-model reasoning occurs at the AI ​​agent layer, and the current trend is collaboration between AI agents. Last week, Web 3’s AI agent protocol ChaimML announced the completion of a $6.2 million seed extension round of financing to launch its revolutionary agent base layer, Theoriq. The core idea is to enable AI agents to dynamically identify and autonomously collaborate with other agents to cope with complex use cases. Theoriq’s testnet is scheduled to launch in the coming summer, and more details are expected to be revealed at the 2024 Consensus conference.

 

Another implementation of multi-model reasoning is the "Mixture of Experts" (MoE) architecture. It contains a group of smaller and highly specialized expert models and lets these models work together to solve the overall problem. GPT-4 is presumably already using this approach. This approach is highly adaptable and provides modular and personalized configuration.

 

Interestingly, the shift towards AI agents and Large Language Models (LLMs) is similar to what is happening in the blockchain space, where we are transitioning from monolithic blockchains to modular blockchains:

 

Monolithic blockchain -> Modular blockchain

Single AI agent -> Modular and composable AI agent base layer

Single large language model -> mixture of expert models

 

In the chain of thoughts (CoT) process that these mixture of expert models (MoE) models go through, the output of one expert model is used as the input of the next expert model.

 

The errors of one model can be mitigated by the advantages of another model, leading to more reliable results. However, errors can also be amplified in this chain of reasoning.

 

This poses a threat, after all, large language models (LLMs) can be used for both good and bad, like a double-edged sword.

 

OpenAI's SSL certificate logs show the development of "search.chatgpt.com" and the potential launch of a search product. This suggests that more and more large language model (LLM) projects may launch their own search engine products to compete with well-known platforms such as Google and Perplexity.

 

Given that more and more people now believe everything large language models (LLMs) say without question, malicious actors have unlimited incentives to start polluting the output of LLMs by feeding false knowledge into AI models as training data. If a malicious actor introduces as little as 1% or 2% bias into the training data, it is possible for the model chain to propagate these biases and significantly poison the results.

 

This could become very scary if malicious actors could influence human decision-making by contaminating the data fed into Large Language Models (LLMs), especially during major events such as the upcoming presidential election. This manipulation could even skew voting outcomes if individuals were exposed to false or fabricated information spread by LLMs.

 

The impact of false information and polarized political views spread on Twitter and the criticism that follows has been evident in the 2016 and 2020 elections!

 

Fortunately, as we enter the world of artificial general intelligence (AGI), Web 3.0 and blockchain technology provide a panacea for ensuring data integrity, quality, and privacy issues.

 

The future of AI looks very bright, and we look forward to seeing how innovations in the encrypted data space will continue to empower AI.