1 Introduction
From the first wave of dApps Etheroll, ETHLend and CryptoKitties in 2017 to the current flourishing of various financial, gaming and social dApps based on different blockchains, when we talk about decentralized on-chain applications, have we ever thought about the sources of various data adopted by these dApps in their interactions?
In 2024, the focus will be on AI and Web3. In the world of artificial intelligence, data is like the life source of its growth and evolution. Just as plants rely on sunlight and water to thrive, AI systems also rely on massive amounts of data to continuously "learn" and "think". Without data, no matter how sophisticated the AI algorithm is, it is nothing but a castle in the air and cannot exert its due intelligence and efficiency.
From the perspective of blockchain data accessibility, this article deeply analyzes the evolution of blockchain data indexing in the process of industry development, and compares the old data indexing protocol The Graph with the emerging blockchain data service protocols Chainbase and Space and Time, especially discussing the similarities and differences between these two new protocols that combine AI technology in data services and product architecture features.
2. The complexity and simplicity of data indexing: from blockchain nodes to full-chain databases
2.1 Data source: blockchain node
From the beginning of understanding "what is blockchain", we often see this sentence: blockchain is a decentralized ledger. Blockchain nodes are the foundation of the entire blockchain network, and are responsible for recording, storing and disseminating all transaction data on the chain. Each node has a complete copy of the blockchain data to ensure that the decentralized nature of the network is maintained. However, for ordinary users, it is not easy to build and maintain a blockchain node by themselves. This not only requires professional technical capabilities, but also comes with high hardware and bandwidth costs. At the same time, ordinary nodes have limited query capabilities and cannot query data in the format required by developers. Therefore, although in theory everyone can run their own nodes, in practice, users usually prefer to rely on third-party services.
To solve this problem, RPC (Remote Procedure Call) node providers came into being. These providers are responsible for the cost and management of the nodes and provide data through RPC endpoints. This allows users to easily access blockchain data without building their own nodes. Public RPC endpoints are free, but have rate limits, which may have a negative impact on the user experience of dApps. Private RPC endpoints provide better performance by reducing congestion, but even simple data retrieval requires a lot of back-and-forth communication. This makes them request-heavy and inefficient for complex data queries. In addition, private RPC endpoints are often difficult to scale and lack compatibility across different networks. However, the standardized API interface of the node provider gives users a lower threshold to access data on the chain, laying the foundation for subsequent data parsing and application.
2.2 Data analysis: from prototype data to usable data
The data obtained from blockchain nodes are often raw data that has been encrypted and encoded. Although these data retain the integrity and security of the blockchain, their complexity also increases the difficulty of data analysis. For ordinary users or developers, directly processing these prototype data requires a lot of technical knowledge and computing resources.
The process of data parsing is particularly important in this context. By parsing complex prototype data and converting it into a format that is easier to understand and operate, users can understand and use the data more intuitively. The success of data parsing directly determines the efficiency and effectiveness of blockchain data applications and is a key step in the entire data indexing process.
2.3 Evolution of Data Indexers
As the amount of blockchain data increases, the need for data indexers also increases. Indexers play a vital role in organizing on-chain data and sending it to the database for easy querying. Indexers work by indexing blockchain data and making it readily available through SQL-like query languages (APIs such as GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using a standardized query language, greatly simplifying the process.
Different types of indexers optimize data retrieval in various ways:
Full Node Indexers: These indexers run full blockchain nodes and extract data directly from them, ensuring that the data is complete and accurate, but requiring a lot of storage and processing power.
Lightweight indexers: These indexers rely on full nodes to fetch specific data as needed, reducing storage requirements but potentially increasing query times.
Specialized indexers: These indexers specialize in certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.
Aggregate indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is particularly useful for multi-chain dApps.
Currently, the archive mode of the Ethereum Archive Node in the Geth client takes up about 13.5 TB of storage space, while under the Erigon client, the archive requirement is about 3 TB. As the blockchain continues to grow, the data storage capacity of the archive node will also increase. Faced with such a large amount of data, the mainstream indexer protocol not only supports multi-chain indexing, but also customizes the data parsing framework for the data requirements of different applications. For example, The Graph's "Subgraph" framework is a typical case.
The emergence of indexers has greatly improved the efficiency of data indexing and querying. Compared with traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed queries. These indexers allow users to perform complex queries, easily filter data, and analyze it after extraction. In addition, some indexers also support aggregating data sources from multiple blockchains, avoiding the problem of deploying multiple APIs in multi-chain dApps. By running distributedly on multiple nodes, indexers not only provide stronger security and performance, but also reduce the risk of interruptions and downtime that may be caused by centralized RPC providers.
In contrast, the indexer uses a predefined query language to allow users to directly obtain the required information without processing the underlying complex data. This mechanism significantly improves the efficiency and reliability of data retrieval and is an important innovation in blockchain data access.
2.4 Full-chain database: Flow-first alignment
Using index nodes to query data usually means that the API becomes the only portal for digesting on-chain data. However, when a project enters the expansion phase, it often requires more flexible data sources, which cannot be provided by standardized APIs. As application requirements become more complex, primary data indexers and their standardized index formats gradually find it difficult to meet more and more diverse query requirements, such as search, cross-chain access, or off-chain data mapping.
In modern data pipeline architectures, the "stream-first" approach has emerged as a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift enables organizations to respond immediately to incoming data, thereby gaining insights and making decisions almost instantly. Similarly, the development of blockchain data service providers is also moving towards building blockchain data streams. Traditional indexer service providers have successively launched products that obtain real-time blockchain data in the form of data streams, such as The Graph's Substreams, Goldsky's Mirror, and real-time data lakes that generate data streams based on blockchains, such as Chainbase and SubSquid.
These services are designed to address the need for real-time analysis of blockchain transactions and more comprehensive query capabilities. Just as the "stream-first" architecture has revolutionized the way data is processed and consumed in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream service providers also hope to support the development of more applications and assist in on-chain data analysis through more advanced and mature data sources.
By reframing the challenges of on-chain data through the lens of a modern data pipeline, we can see the full potential of managing, storing, and serving on-chain data from a whole new perspective. When we start thinking of indexers like subgraphs and Ethereum ETL as data flows in a data pipeline rather than final outputs, we can imagine a possible world where high-performance datasets can be tailored for any business use case.
3 AI + Database? In-depth comparison of The Graph, Chainbase, Space and Time
3.1 The Graph
The Graph network implements multi-chain data indexing and query services through a decentralized node network, facilitating developers to conveniently index blockchain data and build decentralized applications. Its main product models are the data query execution market and the data index cache market. Both markets essentially serve the user's product query needs. The data query execution market specifically refers to consumers paying for the data they need to choose the appropriate index node that provides the data, while the data index cache market is a market where index nodes mobilize resource allocation based on the historical index popularity of subgraphs, query fees collected, and the on-chain curators' demand for subgraph output.
Subgraphs are the basic data structures in The Graph network. They define how to extract and transform data from the blockchain into a queryable format (such as a GraphQL schema). Anyone can create a subgraph, and multiple applications can reuse these subgraphs, which improves data reusability and efficiency.
The Graph Product Structure (Source: The Graph Whitepaper)
The Graph Network consists of four key roles: Indexers, Curators, Delegators, and Developers, who together provide data support for web3 applications. Here are their respective responsibilities:
Indexer: Indexer is a node operator in The Graph network. Indexers participate in the network by staking GRT (the native token of The Graph) and provide indexing and query processing services.
Delegator: Delegators are users who stake GRT tokens to index nodes to support their operations. Delegators earn part of the rewards through the index nodes they delegate.
Curator: Curators are responsible for signaling which subgraphs should be indexed by the network. Curators help ensure that valuable subgraphs are prioritized.
Developer: Unlike the first three who are suppliers, developers are demanders and the main users of The Graph. They create and submit subgraphs to The Graph network, waiting for the network to meet the required data.
The Graph has now moved to a fully decentralized subgraph hosting service, with economic incentives circulating between different participants to ensure the operation of the system:
Index Node Rewards: Index nodes earn revenue from consumers’ query fees and a portion of GRT token block rewards.
Delegator Rewards: Delegators receive a portion of the rewards from the index nodes they support.
Curator Rewards: If curators signal a valuable subgraph, they can receive a portion of the reward from query fees.
In fact, The Graph's products are also developing rapidly in the AI wave. As one of the core development teams of The Graph ecosystem, Semiotic Labs has been committed to using AI technology to optimize index pricing and user query experience. Currently, the AutoAgora, Allocation Optimizer and AgentC tools developed by Semiotic Labs have improved the performance of the ecosystem in many aspects.
AutoAgora introduces a dynamic pricing mechanism to adjust prices in real time based on query volume and resource usage, optimize pricing strategies, and ensure indexers’ competitiveness and maximize revenue.
Allocation Optimizer solves the complex problem of subgraph resource allocation, helping indexers achieve optimal resource allocation to improve revenue and performance.
AgentC is an experimental tool that allows users to access The Graph’s blockchain data through natural language, thereby improving the user experience.
The application of these tools enables The Graph to further improve the intelligence and user-friendliness of the system with the help of AI.
3.2 Chainbase
Chainbase is a full-chain data network that integrates all blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:
Real-time Data Lake: Chainbase provides a real-time data lake dedicated to blockchain data streams, making data instantly accessible as it is generated.
Dual-chain architecture: Chainbase built an execution layer based on Eigenlayer AVS, forming a parallel dual-chain architecture with the CometBFT consensus algorithm. This design enhances the programmability and composability of cross-chain data, supports high throughput, low latency and finality, and improves network security through a dual-staking model.
Innovative data format standards: Chainbase introduced a new data format standard called "manuscripts", which optimizes the way data is structured and utilized in the crypto industry.
Crypto World Model: With its vast blockchain data resources, Chainbase combined AI model technology to create an AI model that can effectively understand, predict and interact with blockchain transactions. The basic version of the model, Theia, has been launched for public use.
These features make Chainbase stand out among blockchain indexing protocols, with a particular focus on accessibility to real-time data, innovative data formats, and creating smarter models to improve insights by combining on-chain and off-chain data.
Chainbase's AI model Theia is the key highlight that distinguishes it from other data service protocols. Based on the DORA model developed by NVIDIA, Theia combines on-chain and off-chain data and spatiotemporal activities to learn and analyze cryptographic patterns and respond through causal reasoning, thereby deeply exploring the potential value and laws of on-chain data and providing users with more intelligent data services.
AI-enabled data services make Chainbase no longer just a blockchain data service platform, but a more competitive intelligent data service provider. Through powerful data resources and AI's proactive analysis, Chainbase is able to provide broader data insights and optimize users' data processing.
3.3 Space and Time
Space and Time (SxT) aims to build a verifiable computing layer and expand zero-knowledge proofs on decentralized data warehouses to provide trusted data processing for smart contracts, large language models, and enterprises. Currently, Space and Time has received $20 million in the latest round of Series A financing, led by Framework Ventures, Lightspeed Faction, Arrington Capital, and Hivemind Capital.
In the field of data indexing and verification, Space and Time has introduced a new technical path - Proof of SQL. This is an innovative zero-knowledge proof (ZKP) technology developed by Space and Time to ensure that SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. When a query is run, Proof of SQL generates a cryptographic proof that verifies the integrity and accuracy of the query results. This proof is attached to the query results, allowing any verifier (such as smart contracts, etc.) to independently confirm that the data has not been tampered with during processing. Traditional blockchain networks usually rely on consensus mechanisms to verify the authenticity of data, while Space and Time's Proof of SQL implements a more efficient way to verify data. Specifically, in Space and Time's system, one node is responsible for acquiring data, while other nodes verify the authenticity of the data through zk technology. This approach changes the resource consumption of multiple nodes repeatedly indexing the same data under the consensus mechanism until they finally reach a consensus to obtain data, and improves the overall performance of the system. As this technology matures, it has created a foundation for a series of traditional industries that focus on data reliability to use data on the blockchain to construct products.
At the same time, SxT has been working closely with Microsoft AI Joint Innovation Lab to accelerate the development of generative AI tools to make it easier for users to process blockchain data through natural language. Currently in Space and Time Studio, users can experience entering natural language queries, and AI will automatically convert them into SQL and execute the query statements on behalf of the user to present the final results the user needs.
3.4 Difference Comparison
Conclusion and Outlook
In summary, blockchain data indexing technology has undergone a gradual improvement process from the initial node data source, through the development of data parsing and indexers, to the AI-enabled full-chain data service. The continuous evolution of these technologies has not only improved the efficiency and accuracy of data access, but also brought unprecedented intelligent experience to users.
Looking ahead, with the continuous development of new technologies such as AI and zero-knowledge proof, blockchain data services will become more intelligent and secure. We have reason to believe that blockchain data services will continue to play an important role as infrastructure in the future, providing strong support for the industry's progress and innovation.