Written by Geng Kai, DFG

The importance of data in blockchain

Data is key to blockchain technology and is fundamental to the development of decentralized applications (dApps). While much of the current discussion revolves around data availability (DA) — ensuring that every network participant has access to recent transaction data for verification — there is an equally important aspect that is often overlooked: data accessibility.

DA solutions have become indispensable in the era of modular blockchains. These solutions ensure that transaction data is available to all participants, enabling real-time verification and maintaining the integrity of the network. However, the DA layer functions more like a billboard than a database. This means that data is not stored indefinitely; it is deleted over time, just as posters on a billboard are eventually replaced with new ones.

Data accessibility, on the other hand, focuses on the ability to retrieve historical data, which is essential for developing dApps and conducting blockchain analytics. This aspect is critical for tasks that require access to past data to ensure accurate representation and execution. Although data accessibility is important, it is less discussed, but it is just as important as data availability. The two play different but complementary roles in the blockchain ecosystem, and a comprehensive data management approach must address both issues to support powerful and efficient blockchain applications.

How blockchain data was previously retrieved

Since its inception, blockchain has revolutionized infrastructure and enabled the creation of decentralized applications (dApps) in various fields such as gaming, finance, and social networking. However, building these dApps requires access to large amounts of blockchain data, which is difficult and expensive.

One option for dApp developers is to host and run their own archive RPC nodes. These nodes store all historical blockchain data from the beginning, allowing full access to the data. However, archive nodes are expensive to maintain and have limited query capabilities, making it impossible to query data in the format that developers need. While running cheaper nodes is an option, these nodes have limited data retrieval capabilities, which may hinder the operation of dApps.

Another approach is to use a commercial RPC (Remote Procedure Call) node provider. These providers are responsible for the cost and management of the nodes and provide data through RPC endpoints. Public RPC endpoints are free but are rate-limited and may negatively impact the user experience of your dApp. Private RPC endpoints provide better performance by reducing congestion, but even simple data retrieval requires a lot of back-and-forth communication. This makes them request-heavy and inefficient for complex data queries. Additionally, private RPC endpoints are often difficult to scale and lack compatibility across different networks.

A better alternative: blockchain indexers

Blockchain indexers play a vital role in organizing on-chain data and sending it to a database for easy querying, which is why they are often referred to as the "Google of blockchain." They work by indexing blockchain data and making it readily available through a SQL-like query language (using APIs such as GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using a standardized query language, greatly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

  1. Full Node Indexers: These indexers run full blockchain nodes and extract data directly from them, ensuring that the data is complete and accurate, but requiring a lot of storage and processing power.

  2. Lightweight indexers: These indexers rely on full nodes to fetch specific data as needed, reducing storage requirements but potentially increasing query times.

  3. Specialized indexers: These indexers specialize in certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.

  4. Aggregate indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is particularly useful for multi-chain dApps.

Ethereum alone requires 3TB of storage space, and as the blockchain continues to grow, the data storage capacity of the Erigon archive node will continue to increase. The Indexer Protocol deploys multiple indexers to efficiently index and query large amounts of data at high speed, which is not possible with RPC.

Indexers also allow for complex queries, easy filtering of data based on different criteria, and analysis of data after extraction. Some indexers also allow for the aggregation of data from multiple sources, thus avoiding the need to deploy multiple APIs in multi-chain dApps. By being distributed across multiple nodes, indexers provide enhanced security and performance, whereas RPC providers may experience outages and downtime due to their centralized nature.

Overall, compared to RPC node providers, indexers improve the efficiency and reliability of data retrieval while also reducing the cost of deploying a single node. This makes the blockchain indexer protocol the first choice for dApp developers.

Indexer use cases

As mentioned earlier, building dApps requires retrieving and reading blockchain data in order to run their services. This includes any type of dApp, including DeFi, NFT platforms, games, and even social networks, as these platforms need to read data before they can perform other transactions.

DeFi

DeFi protocols require different information in order to quote specific prices, rates, fees, etc. to users. Automated market makers (AMMs) require price and liquidity information about certain pools to calculate swap rates, while lending protocols require utilization rates to determine lending rates and debt ratios for liquidations. It is essential to feed this information into their dApps before calculating the interest rates executed by users.

game

GameFi needs to quickly index and access data to ensure smooth game play for users. Only with lightning-fast data retrieval and execution can Web3 games rival Web2 games in performance and attract more users. These games require data on land ownership, in-game token balances, in-game actions, and more. With an indexer, they can better ensure a steady stream of data and consistent uptime to ensure a flawless gaming experience.

NFT

NFT marketplaces and lending platforms need indexed data to access a variety of information, such as NFT metadata, ownership and transfer data, royalty information, etc. Quickly indexing this data avoids browsing through each NFT one by one to find ownership or NFT property data.

Whether it’s a DeFi automated market maker (AMM) that needs price and liquidity information, or a SocialFi app that needs to be updated with new user posts, being able to retrieve data quickly is critical for dApps to function properly. With indexers, they can retrieve data efficiently and correctly, providing a smooth user experience.

analyze

Indexers provide a way to extract specific data from raw blockchain data, including smart contract events in each block. This opens up opportunities for more specific data analysis, providing comprehensive insights.

For example, a perpetual trading protocol can find out which tokens have high trading volume and which tokens incur fees, and decide whether to list these tokens as perpetual contracts on its platform. DEX developers can create dashboards for their own products, and gain insights into which pools have the highest returns or the most liquidity. It is also possible to create public dashboards, giving developers the freedom and flexibility to query any type of data they want to display on a chart.

With multiple blockchain indexers available, identifying the differences between indexing protocols is critical to ensuring developers choose the indexer that best suits their needs.

Blockchain Indexer Overview

Indexer Overview

The Graph

The Graph is the first indexer protocol launched on Ethereum that makes it easy to query transaction data that was previously not easily accessible. It uses subgraphs to define and filter subsets of data collected from the blockchain, such as all transactions related to the Uniswap v3 USDC/ETH pool.

Using Proof of Index, Indexers stake their native token GRT for indexing and querying services, and Delegators can choose to stake their tokens against it. Curators have access to high-quality subgraphs to help Indexers determine which subgraphs to index data for to earn the best query fees. In the transition to greater decentralization, The Graph will eventually stop its hosting services and require subgraphs to upgrade to its network while providing upgraded Indexers.

Its infrastructure enables an average cost of $40 per million queries, which is much lower than the cost of self-hosted nodes. Using file data sources, it also supports parallel indexing of on-chain and off-chain data at the same time for efficient data retrieval.

Looking at the Indexer rewards at The Graph, it has been growing steadily over the past few quarters. This is partly due to the increase in query volume, but also due to the growth in token price as they plan to integrate AI-assisted querying in the future.

Subsquid

Subsquid is a peer-to-peer, horizontally scalable decentralized data lake that efficiently aggregates large amounts of on-chain and off-chain data and protects it with zero-knowledge proofs. As a decentralized worker network, each node is responsible for storing data from a specific subset of blocks, speeding up the data retrieval process by quickly identifying the nodes that hold the required data.

Subsquid also supports real-time indexing, allowing blocks to be indexed before they are finalized. It also supports storing data in a format of the developer's choice, facilitating easier analysis using tools such as BigQuery, Parquet, or CSV. In addition, subgraphs can be deployed on the Subsquid network without migrating to the Squid SDK, enabling codeless deployment.

Despite still being in the testnet phase, Subsquid has achieved impressive statistics with over 80,000 testnet users, over 60,000 Squid indexers deployed, and over 20,000 verified developers on the network. Most recently, on June 3, Subsquid launched the mainnet of its data lake.

In addition to indexing, the Subsquid Network data lake can also replace RPC in use cases such as analytics, ZK/TEE coprocessors, AI agents, and Oracle.

SubQuery

SubQuery is a decentralized middleware infrastructure network that provides RPC and indexing data services. It initially supported Polkadot and Substrate networks and has now expanded to include more than 200 chains. It works similarly to The Graph using Proof of Indexing, where indexers index data and provide query requests, and delegators pledge their shares to indexers. However, it introduces consumers to submit purchase orders to show that the indexer's income is guaranteed, rather than managers.

It will introduce shard-supported SubQuery data nodes to prevent each node from constantly synchronizing new data, thereby optimizing query efficiency while moving toward greater decentralization. Users can choose to pay a computation fee of approximately 1 SQT token per 1,000 requests, or set a custom fee for the indexer through the protocol.

Although SubQuery only launched its token earlier this year, issuance rewards for nodes and delegators have also grown month-over-month in USD value, which also represents the increasing number of query services provided on its platform. Since the TGE, the total amount of staked SQT has increased from 6 million to 125 million, highlighting the growth of its network participation.

Covalent

Covalent is a decentralized indexer network where Block Sample Producer (BSP) network nodes create copies of blockchain data through batch export and publish proofs on the Covalent L1 blockchain. These data are then refined by Block Result Producer (BRP) nodes according to set rules to filter out data that meets the requirements.

Through a unified API, developers can easily extract relevant blockchain data in a consistent request and response format, without having to write custom complex queries to access data. These pre-configured data sets can be extracted from network operators using CQT tokens settled on Moonbeam as a means of payment.

Covalent rewards appear to be generally increasing from Q1’23 to Q1’24, in part due to an increase in the price of Covalent tokens, CQT.

Considerations for choosing an indexer

Customizability of data

Some indexers, such as Covalent, are general-purpose indexers that only provide standard, pre-configured datasets through an API. While they may be fast, they do not provide flexibility for developers who need custom datasets. By using the indexer framework, it allows for more custom data processing to meet application-specific needs.

Safety

Indexed data must be secure, otherwise dApps built on these indexers are also vulnerable to attack. For example, if transactions and wallet balances can be manipulated, there is a risk that the dApp will lose liquidity, which will affect its users. While all indexers employ some form of security through indexer staking tokens, other indexer solutions may use proofs to further increase security.

Subsquid offers the option to use optimistic and zero-knowledge proofs, while Covalent also publishes proofs that include block hashes. The Graph offers dispute challenge periods in the form of optimistic challenge windows for Indexer queries, while SubQuery generates a Merkle Mountain proof for each block to compute the hash of each block of all data stored in its database.

Speed ​​and Scalability

As blockchains continue to grow, transaction volumes increase, making indexing large amounts of data more cumbersome as more processing power and storage space are required. As blockchain networks grow, it becomes more difficult to maintain efficiency, but the Indexer Protocol introduces a solution to meet these growing demands.

For example, Subsquid achieves horizontal scalability by adding more nodes to store data, and it is able to scale as hardware improves. Graph provides parallel streaming data to synchronize data faster, while SubQuery introduces node sharding to speed up the synchronization process.

Supported Networks

While most blockchain activity still happens within Ethereum, different blockchains are growing in popularity over time. For example, Layer 2s, Solana, Move blockchain, and the Bitcoin ecosystem chain all have their own set of growing developers and activity, which also requires indexing services.

Providing support for certain chains that are not supported by other indexer protocols can earn more market share fees. Indexing data-intensive networks such as Solana is not easy, and so far only Subsquid has successfully provided indexing support for them.

in conclusion

Despite their widespread adoption in dApp development, the potential of Indexers remains huge, especially with the integration of AI. As AI continues to gain popularity in Web2 and Web3, its ability to improve depends on access to relevant data to train models and develop AI agents. Ensuring data integrity is critical for AI applications because it prevents models from being fed biased or inaccurate information.

In the area of ​​indexer solutions, Subsquid has made significant progress in performance and user metrics. Users have begun experimenting with building AI agents with Subsquid, demonstrating the platform’s versatility and potential in the evolving data indexing space. In addition, tools such as AutoAgora help indexers use AI to provide dynamic pricing for query services on The Graph, while SubQuery supports multiple AI networks such as OriginTrail and Oraichain for transparent data indexing.

The integration of AI with indexers promises to enhance data accessibility and usability in the blockchain ecosystem. By leveraging AI technology, indexers can provide more efficient and accurate data retrieval, enabling developers to build more sophisticated dApps and analytical tools. As AI and indexers continue to advance together, we remain optimistic about the future of data indexing and its role in shaping the decentralized digital landscape.