Kernel Ventures: An article discussing DA and historical data layer design

Author: Jerry Luo, Kernel Ventures
Reviewed by: Mandy, Kernel Ventures, Joshua, Kernel Ventures
TLDR:
In the early days, public chains required all nodes in the entire network to maintain data consistency to ensure security and decentralization. However, with the development of the blockchain ecosystem, storage pressure has continued to increase, leading to a trend of centralization in node operations. At this stage, Layer1 urgently needs to solve the storage cost problem caused by the increase in TPS.
Faced with this problem, developers need to propose new historical data storage solutions while taking into account security, storage costs, data reading speed and DA layer versatility.
In the process of solving this problem, many new technologies and new ideas have emerged, including Sharding, DAS, Verkle Tree, DA intermediate components, etc. They try to optimize the storage solution of the DA layer by reducing data redundancy and improving data verification efficiency.
At present, DA solutions are roughly divided into two categories based on the data storage location, namely main chain DA and third-party DA. The main chain DA starts from the perspective of regular data cleaning and data sharding storage to reduce node storage pressure. The design requirements of third-party DA are all aimed at storage services, and there are reasonable solutions for large amounts of data. Therefore, it is mainly a trade-off between single-chain compatibility and multi-chain compatibility, and three solutions are proposed: main chain dedicated DA, modular DA, and storage public chain DA.
Payment-based public chains have extremely high requirements for historical data security, and are suitable for using the main chain as the DA layer. However, for public chains that have been running for a long time and have a large number of miners running the network, it would be more appropriate to use a third-party DA that does not involve the consensus layer and takes security into consideration. Comprehensive public chains are more suitable for using main chain-specific DA storage with larger data capacity, lower cost and security. However, considering the needs of cross-chain, modular DA is also a good option.
In general, blockchain is moving towards reducing data redundancy and multi-chain division of labor.
1. BackgroundAs a distributed ledger, blockchain needs to store a copy of historical data on all nodes to ensure the security and sufficient decentralization of data storage. Since the correctness of each state change is related to the previous state (transaction source), in order to ensure the correctness of the transaction, a blockchain should in principle store all historical records from the first transaction to the current transaction. Taking Ethereum as an example, even if the average size of each block is estimated to be 20 kb, the total size of the current Ethereum block has reached 370 GB, and a full node has to record the state and transaction receipts in addition to the block itself. Including this part, the total storage of a single node has exceeded 1 TB, which makes the operation of the node concentrated in a few people.
Ethereum's latest block height, image source: Etherscan
The recent Ethereum Cancun upgrade aims to increase Ethereum's TPS to around 1,000, when Ethereum's annual storage growth will exceed the current total storage. Among the various popular high-performance public chains, transaction speeds of tens of thousands of TPS may bring hundreds of GB of new data per day. The method of common data redundancy for all nodes in the entire network is obviously unable to adapt to such storage pressure. Layer1 must find a suitable solution to balance the growth of TPS and the storage cost of nodes.
2. DA performance indicators2.1 SecurityCompared with database or linked list storage structures, the immutability of blockchain comes from the fact that new data can be verified through historical data. Therefore, ensuring the security of historical data is the first issue to be considered in DA layer storage. When judging the data security of blockchain systems, we often analyze it from the perspective of the amount of data redundancy and the verification method of data availability.
Redundancy: The redundancy of data in the blockchain system can mainly play the following roles: First, if the number of redundancies in the network is greater, when the verifier needs to check the account status in a historical block to verify a current transaction, it can get the most samples for reference, and select the data recorded by most nodes. In traditional databases, since data is only stored in a certain node in the form of key-value pairs, historical data can only be changed on a single node, and the attack cost is extremely low. In theory, the more redundancies, the more reliable the data. At the same time, the more nodes are stored, the less likely the data will be lost. This can also be compared to the centralized server storing Web2 games. Once all the backend servers are shut down, the server will be completely shut down. However, the more redundancies, the better, because each redundancy will bring additional storage space, and too much data redundancy will bring too much storage pressure to the system. A good DA layer should choose a suitable redundancy method to strike a balance between security and storage efficiency.
Data availability verification: The number of redundancies ensures that there are enough records of data in the network, but the data to be used must also be verified for accuracy and completeness. The commonly used verification method in the current blockchain is the cryptographic commitment algorithm, which retains a very small cryptographic commitment for the entire network to record. This commitment is obtained by mixing transaction data. To verify the authenticity of a piece of historical data, it is necessary to restore the cryptographic commitment through the data, and check whether the restored cryptographic commitment is consistent with the records of the entire network. If they are consistent, the verification is passed. Commonly used cryptographic verification algorithms are Merkle Root and Verkle Root. High-security data availability verification algorithms only require very little verification data and can quickly verify historical data.
2.2 Storage CostOn the premise of ensuring basic security, the next core goal that the DA layer needs to achieve is to reduce costs and increase efficiency. The first is to reduce storage costs. Without considering the difference in hardware performance, that is, to reduce the memory usage caused by the storage unit size data. At present, the way to reduce storage costs in blockchain is mainly to adopt sharding technology and use reward-based storage to ensure that data is effectively stored on the basis of reducing the number of data backups. However, it is not difficult to see from the above improvements that there is a game relationship between storage costs and data security, and reducing storage usage often means a decrease in security. Therefore, an excellent DA layer needs to achieve a balance between storage costs and data security. In addition, if the DA layer is a separate public chain, it is also necessary to reduce costs by minimizing the intermediate processes experienced by data exchange. In each transfer process, index data needs to be left for subsequent query calls. Therefore, the longer the call process, the more index data will be left, which increases the storage cost. Finally, the storage cost of data is directly linked to the persistence of data. Generally speaking, the higher the storage cost of data, the more difficult it is for the public chain to store data persistently.
2.3 Data reading speedAfter achieving cost reduction, the next step is to increase efficiency, that is, the ability to quickly call out data from the DA layer when it is needed. This process involves two steps. The first is to search for nodes that store data. This process is mainly for public chains that have not achieved data consistency across the entire network. If the public chain achieves data synchronization across all nodes in the entire network, the time consumption of this process can be ignored. Secondly, in the current mainstream blockchain systems, including Bitcoin, Ethereum, and Filecoin, the node storage method is the Leveldb database. In Leveldb, data is stored in three ways. First, the data written in real time will be stored in a Memtable type file. When the Memtable is full, the file type will be changed from Memtable to Immutable Memtable. Both types of files are stored in memory, but Immutable Memtable files cannot be changed and can only be read from them. The hot storage used in the IPFS network stores data in this part, and it can be quickly read from the memory when it is called, but the mobile memory of an ordinary node is often at the GB level, which can easily slow down the writing, and when the node crashes or other abnormal conditions occur, the data in the memory will be permanently lost. If you want to store data persistently, you need to store it in the form of SST files on a solid-state drive (SSD), but when reading data, you need to read the data into memory first, which greatly reduces the data indexing speed. Finally, for systems that use sharded storage, when restoring data, you need to send data requests to multiple nodes and restore them, which also reduces the data reading speed.
Leveldb data storage method, image source: Leveldb-handbook
2.4 DA layer universalityWith the development of DeFi and the various problems of CEX, users' requirements for cross-chain transactions of decentralized assets are also growing. Whether it is a cross-chain mechanism of hash locking, notary or relay chain, it is inevitable to determine the historical data on the two chains at the same time. The key to this problem lies in the separation of data on the two chains, and direct communication cannot be achieved in different decentralized systems. Therefore, at this stage, a solution is proposed by changing the storage method of the DA layer, which stores the historical data of multiple public chains on the same trusted public chain, and only needs to call the data on this public chain when verifying. This requires the DA layer to be able to establish a secure communication method with different types of public chains, that is, the DA layer has good versatility.
3. Exploration of DA-related technologies3.1 ShardingIn traditional distributed systems, a file is not stored in a complete form on a certain node. Instead, the original data is divided into multiple blocks and a block is stored in each node. And blocks are often not stored on only one node, but appropriate backups are left on other nodes. In existing mainstream distributed systems, the number of backups is usually set to 2. This Sharding mechanism can reduce the storage pressure of a single node, expand the total capacity of the system to the sum of the storage capacity of each node, and ensure the security of storage through appropriate data redundancy. The Sharding scheme adopted in the blockchain is generally similar to this, but there are differences in specific details. First, since each node in the blockchain is untrustworthy by default, a sufficiently large amount of data backup is required in the process of implementing Sharding for subsequent data authenticity judgment, so the number of backups of this node needs to be much more than 2. Ideally, in a blockchain system that adopts this storage scheme, if the total number of verification nodes is T and the number of shards is N, then the number of backups should be T/N. Secondly, there is the storage process of Block. In traditional distributed systems, there are fewer nodes, so one node is often adapted to multiple data blocks. First, the data is mapped to the hash ring through the consistent hashing algorithm, and then each node stores the data blocks numbered in a certain range, and it can be accepted that a node is not assigned a storage task in a certain storage. On the blockchain, whether each node is assigned to a Block is no longer a random event but an inevitable event. Each node will randomly select a Block for storage. This process is completed by taking the result of hashing the data with the original data of the block and the node's own information and taking the remainder of the number of shards. Assuming that each piece of data is divided into N Blocks, the actual storage size of each node is only 1/N of the original. By setting N appropriately, a balance can be achieved between the increased TPS and the node storage pressure.
Data storage method after Sharding, Image source: Kernel Ventures
3.2 DAS（Data Availability Sampling）DAS technology is a further optimization of Sharding storage. In the Sharding process, due to the simple random storage of nodes, a block may be lost. Secondly, for the sharded data, how to confirm the authenticity and integrity of the data during the restoration process is also very important. In DAS, these two problems are solved through Eraser code and KZG polynomial commitment.
Eraser code: Considering the huge number of verification nodes in Ethereum, the probability that a block is not stored by any node is almost zero, but theoretically there is still the possibility of such an extreme situation. In order to mitigate the threat of storage loss, this solution often does not directly divide the original data into blocks for storage, but first maps the original data to the coefficients of an n-order polynomial, then takes 2n points on the polynomial, and lets the node randomly select one for storage. For this n-order polynomial, only n+1 points are needed to restore it, so only half of the blocks need to be selected by the nodes to restore the original data. Eraser code improves the security of data storage and the network's ability to recover data.
KZG polynomial commitment: A very important part of data storage is the verification of data authenticity. In a network that does not use Eraser code, various methods can be used for verification, but if the Eraser code mentioned above is introduced to improve data security, then the more appropriate method is to use KZG polynomial commitment. KZG polynomial commitment can directly verify the content of a single Block in the form of a polynomial, thereby eliminating the process of restoring the polynomial to binary data. The overall form of verification is similar to Merkle Tree, but does not require specific Path node data. Only KZG Root and Block data are needed to verify its authenticity.
3.3 DA layer data verification methodData verification ensures that the data called from the node has not been tampered with and has not been lost. In order to minimize the amount of data and computing cost required in the verification process, the DA layer currently uses tree structure as the mainstream verification method. The simplest form is to use Merkle Tree for verification, which is recorded in the form of a complete binary tree. It only needs to keep a Merkle Root and the hash value of the subtree on the other side of the node path to verify. The time complexity of the verification is O(logN) level (if logN is not added to the base, the default is log2(N)). Although the verification process has been greatly simplified, the amount of data in the verification process generally increases with the increase of data. In order to solve the problem of increased verification volume, another verification method, Verkle Tree, is proposed at this stage. In addition to storing value, each node in the Verkle Tree also comes with a Vector Commitment. The authenticity of the data can be quickly verified through the value of the original node and this commitment proof without calling the values ​​of other sister nodes. This makes the number of calculations for each verification only related to the depth of the Verkle Tree, which is a fixed constant, thereby greatly speeding up the verification speed. However, the calculation of Vector Commitment requires the participation of all sister nodes in the same layer, which greatly increases the cost of writing and changing data. However, for historical data that is permanently stored and cannot be tampered with, and only requires reading but not writing, Verkle Tree is extremely suitable. In addition, Merkle Tree and Verkle Tree themselves have variants in the form of K-ary, and their specific implementation mechanisms are similar, except that the number of subtrees under each node is changed. The specific performance comparison can be seen in the table below.
Data verification method time performance comparison, image source: Verkle Trees
3.4 General DA MiddlewareThe continuous expansion of the blockchain ecosystem has brought about an increasing number of public chains. Due to the advantages and irreplaceability of each public chain in its own field, it is unlikely that the Layer1 public chain will be unified in a short period of time. However, with the development of DeFi and the various problems of CEX, users' requirements for decentralized cross-chain trading assets are also growing. Therefore, DA layer multi-chain data storage, which can eliminate security issues in cross-chain data interaction, has received more and more attention. However, to accept historical data from different public chains, the DA layer needs to provide a decentralized protocol for standardized storage and verification of data streams. For example, kvye, a storage middleware based on Arweave, takes the initiative to grab data from the chain, and can store all chain data in Arweave in a standard form to minimize the differences in the data transmission process. Relatively speaking, Layer2, which provides DA layer data storage specifically for a certain public chain, interacts with data through internal shared nodes. Although it reduces the cost of interaction and improves security, it has relatively large limitations and can only provide services to specific public chains.
4. DA layer storage solution4.1 Main Chain DA4.1.1 Class DankSharding
This type of storage solution has not yet been named, and the most prominent representative is DankSharding on Ethereum, so this article uses the DankSharding class to refer to this type of solution. This type of solution mainly uses the two DA storage technologies mentioned above, Sharding and DAS. First, the data is divided into a suitable number of copies through Sharding, and then each node extracts a data block in the form of DAS for storage. If there are enough nodes in the entire network, we can take a larger number of shards N, so that the storage pressure of each node is only 1/N of the original, thereby achieving N times the overall storage space expansion. At the same time, in order to prevent the extreme situation that a certain block is not stored by any block, DankSharding encodes the data using Eraser Code, and only half of the data is needed for complete restoration. Finally, the data verification process uses the structure of the Verkle tree and polynomial commitment to achieve fast verification.
4.1.2 Short-term storage
For the main chain's DA, one of the simplest ways to process data is to store historical data for a short period of time. In essence, the blockchain serves as a public ledger, which enables changes to the ledger content under the premise of joint witnessing by the entire network, without the need for permanent storage. Taking Solana as an example, although its historical data has been synchronized to Arweave, the main network node only retains transaction data for the past two days. On the public chain based on account records, the historical data at each moment retains the final status of the account on the blockchain, which is sufficient to provide verification basis for changes at the next moment. For project parties that have special needs for data before this time period, they can store it on other decentralized public chains or by a trusted third party. In other words, people who have additional needs for data need to pay for the storage of historical data.
4.2 Third-party DA4.2.1 Mainchain-specific DA: EthStorage
Main chain dedicated DA: The most important thing about the DA layer is the security of data transmission. In this regard, the DA of the main chain has the highest security. However, the storage of the main chain is limited by storage space and resource competition. Therefore, when the amount of network data grows rapidly, if you want to achieve long-term storage of data, a third-party DA will be a better choice. If the third-party DA has higher compatibility with the main network, it can realize the sharing of nodes, and the data interaction process will also have higher security. Therefore, under the premise of considering security, the main chain dedicated DA will have huge advantages. Taking Ethereum as an example, a basic requirement for the main chain dedicated DA is to be compatible with EVM and ensure interoperability with Ethereum data and contracts. Representative projects include Topia, EthStorage, etc. Among them, EthStorage is the most complete in terms of compatibility, because in addition to the compatibility at the EVM level, it also specially sets up relevant interfaces to connect with Ethereum development tools such as Remix and Hardhat to achieve compatibility at the Ethereum development tool level.
EthStorage: EthStorage is a public chain independent of Ethereum, but the nodes running on it are a supergroup of Ethereum nodes, that is, the nodes running EthStorage can also run Ethereum at the same time, and EthStorage can be directly operated through the operation code on Ethereum. In the storage mode of EthStorage, only a small amount of metadata is retained on the Ethereum main network for indexing, which essentially creates a decentralized database for Ethereum. In the current solution, EthStorage implements the interaction between the Ethereum main network and EthStorage by deploying an EthStorage Contract on the Ethereum main network. If Ethereum wants to store data, it needs to call the put() function in the contract. The input parameters are two byte variables key and data, where data represents the data to be stored, and key is its identifier in the Ethereum network, which can be regarded as similar to the existence of CID in IPFS. After the (key, data) data pair is successfully stored in the EthStorage network, EthStorage will generate a kvldx and return it to the Ethereum main network, which corresponds to the key on Ethereum. This value corresponds to the storage address of the data on EthStorage, so that the problem of storing a large amount of data has now become storing a single (key, kvldx) pair, which greatly reduces the storage cost of the Ethereum main network. If you need to call the previously stored data, you need to use the get() function in EthStorage and enter the key parameter. You can quickly find the data on EthStorage through the kvldx stored in Ethereum.
EthStorage contract. Image source: Kernel Ventures
In terms of the specific way nodes store data, EthStorage draws on the model of Arweave. First, a large number of (k,v) pairs from ETH are sharded. Each Sharding contains a fixed number of (k,v) data pairs, and the specific size of each (k,v) pair is also limited. In this way, the fairness of the workload in the subsequent storage reward process for miners is guaranteed. For the issuance of rewards, it is necessary to verify whether the node stores data. In this process, EthStorage will divide a Sharding (TB-level size) into a large number of chunks and retain a Merkle root on the Ethereum mainnet for verification. Then the miner needs to first provide a nonce to generate the addresses of several chunks through a random algorithm with the hash of the previous block on EthStorage. The miner needs to provide the data of these chunks to prove that it has indeed stored the entire Sharding. However, this nonce cannot be selected arbitrarily, otherwise the node will select a suitable nonce that only corresponds to the chunk it stores and pass the verification. Therefore, this nonce must make the chunk it generates meet the network requirements after mixing and hashing, and only the first node to submit the nonce and random access proof can obtain the reward.
4.2.2 Modular DA: Celestia
Blockchain module: At present, the transactions that Layer1 public chains need to perform are mainly divided into the following four parts: (1) Design the underlying logic of the network, select verification nodes in a certain way, write blocks and distribute rewards to network maintainers; (2) Package and process transactions and publish related transactions; (3) Verify the transactions to be uploaded and determine the final status; (4) Store and maintain historical data on the blockchain. According to the different functions completed, we can divide the blockchain into four modules, namely the consensus layer, execution layer, settlement layer, and data availability layer (DA layer).
Modular blockchain design: For a long time, these four modules have been integrated into a public chain, which is called a monolithic blockchain. This form is more stable and easy to maintain, but it also puts tremendous pressure on a single public chain. In actual operation, these four modules constrain each other and compete for the limited computing and storage resources of the public chain. For example, to increase the processing speed of the processing layer, it will bring greater storage pressure to the data availability layer; to ensure the security of the execution layer, a more complex verification mechanism is required, but the speed of transaction processing will be slowed down. Therefore, the development of public chains often faces a trade-off between these four modules. In order to break through the bottleneck of improving the performance of the public chain, developers proposed a modular blockchain solution. The core idea of ​​modular blockchain is to separate one or more of the above four modules and hand them over to a separate public chain for implementation. In this way, the public chain can only focus on improving transaction speed or storage capacity, breaking through the limitations on the overall performance of the blockchain caused by the short board effect.
Modular DA: Separating the DA layer from the blockchain business and handing it over to a public chain alone is considered a feasible solution to the growing historical data of Layer1. At this stage, the exploration in this area is still in its early stages, and the most representative project is Celestia. In terms of the specific storage method, Celestia draws on the storage method of Danksharding, which also divides the data into multiple blocks, and each node extracts a part for storage and uses KZG polynomial commitment to verify the data integrity. At the same time, Celestia uses advanced two-dimensional RS erasure codes to rewrite the original data in the form of k*k matrices, and only 25% of the original data is needed to restore the original data. However, data sharding storage essentially only multiplies the storage pressure of the nodes of the entire network by a coefficient on the total amount of data, and the storage pressure of the nodes still maintains a linear growth with the amount of data. As Layer1 continues to improve the transaction speed, the storage pressure of the nodes may still reach an unacceptable critical point one day. In order to solve this problem, the IPLD component is introduced in Celestia for processing. The data in the k*k matrix is ​​not directly stored on Celestia, but stored in the LL-IPFS network, and only the CID code of the data on IPFS is retained in the node. When a user requests a piece of historical data, the node will send the corresponding CID to the IPLD component, and call the original data on IPFS through the CID. If the data exists on IPFS, it will be returned through the IPLD component and the node; if it does not exist, the data cannot be returned.
Celestia data reading method, image source: Celestia Core
Celestia: Taking Celestia as an example, we can get a glimpse of the application of modular blockchain in solving Ethereum storage problems. The Rollup node will send the packaged and verified transaction data to Celestia and store the data on Celestia. During this process, Celestia only stores the data without too much perception. Finally, according to the size of the storage space, the Rollup node will pay Celestia the corresponding tia token as storage fee. The storage in Celstia uses DAS and erasure codes similar to those in EIP4844, but the polynomial erasure code in EIP4844 has been upgraded, and the two-dimensional RS erasure code has been used to upgrade the storage security again. Only 25% of the fractures are needed to restore the entire transaction data. In essence, it is just a POS public chain with low storage cost. If it is to be used to solve the historical data storage problem of Ethereum, many other specific modules are needed to cooperate with Celestia. For example, in terms of Rollup, a Rollup mode strongly recommended on the Celestia official website is Sovereign Rollup. Unlike the common Rollup on Layer2, which only calculates and verifies transactions, that is, completes the operations of the execution layer. Sovereign Rollup includes the entire execution and settlement process, which minimizes the processing of transactions on Celestia. When the overall security of Celestia is weaker than that of Ethereum, this measure can maximize the security of the overall transaction process. In terms of the security of Celestia calling data on the Ethereum mainnet, the most mainstream solution at present is the quantum gravity bridge smart contract. For the data stored on Celestia, it will generate a Merkle Root (data availability proof) and keep it on the quantum gravity bridge contract of the Ethereum mainnet. Every time Ethereum calls historical data on Celestia, it will compare its hash result with the Merkle Root. If it matches, it means that it is indeed real historical data.
4.2.3 Storage public chain DA
In terms of the technical principle of the main chain DA, many technologies similar to Sharding are borrowed from the storage public chain. Among the third-party DAs, some directly use the storage public chain to complete part of the storage task. For example, the specific transaction data in Celestia is placed on the LL-IPFS network. In the third-party DA solution, in addition to building a separate public chain to solve the storage problem of Layer1, a more direct way is to directly connect the storage public chain with Layer1 to store the huge historical data on Layer1. For high-performance blockchains, the volume of historical data is even larger. When running at full speed, the data volume of the high-performance public chain Solana is close to 4 PG, which is completely beyond the storage range of ordinary nodes. The solution chosen by Solana is to store historical data on the decentralized storage network Arweave, and only retain 2 days of data on the main network node for verification. In order to ensure the security of the storage process, Solana and Arweave chain have designed a storage bridge protocol Solar Bridge. The data verified by the Solana node will be synchronized to Arweave and the corresponding tag will be returned. By just using this tag, the Solana node can view the historical data of the Solana blockchain at any time. On Arweave, it is not necessary for all nodes in the network to maintain data consistency and use this as a threshold for participating in the network operation. Instead, a reward storage method is adopted. First of all, Arweave does not use the traditional chain structure to build blocks, but is more like a graph structure. In Arweave, a new block will not only point to the previous block, but also randomly point to a generated block Recall Block. The specific location of the Recall Block is determined by the hash result of its previous block and its block height. Before the previous block is mined, the location of the Recall Block is unknown. However, in the process of generating a new block, the node needs to have the data of the Recall Block to use the POW mechanism to calculate the hash of the specified difficulty. Only the miner who first calculates the hash that meets the difficulty can get the reward, which encourages miners to store as much historical data as possible. At the same time, the fewer people who store a certain historical block, the fewer competitors the node will have when generating a nonce that meets the difficulty, encouraging miners to store fewer blocks in the network.Finally, in order to ensure that nodes store data permanently in Arweave, WildFire's node scoring mechanism is introduced. Nodes tend to communicate with nodes that can provide more historical data faster, while nodes with lower scores often cannot obtain the latest block and transaction data in the first place and thus cannot gain an advantage in the POW competition.
Arweave block construction method, image source: Arweave Yellow-Paper
5. Comprehensive comparisonNext, we will compare the advantages and disadvantages of the five storage solutions based on the four dimensions of DA performance indicators.
Security: The biggest source of data security issues is the loss caused by data transmission and malicious tampering from dishonest nodes. In the cross-chain process, the independence and non-sharing of the two public chains make it a disaster area for data transmission security. In addition, Layer 1, which currently requires a dedicated DA layer, often has a strong consensus group, and its own security is much higher than that of ordinary storage public chains. Therefore, the main chain DA solution has higher security. After ensuring the security of data transmission, the next step is to ensure the security of calling data. If only the short-term historical data used to verify transactions is considered, the same data is backed up by the entire network in the temporary storage network, while in the DankSharding-like solution, the average number of data backups is only 1/N of the number of nodes in the entire network. More data redundancy can make data less likely to be lost, and it can also provide more reference samples during verification. Therefore, temporary storage will have relatively higher data security. In the third-party DA solution, the main chain dedicated DA uses public nodes with the main chain, and data can be directly transmitted through these relay nodes during the cross-chain process, so it will also have relatively higher security than other DA solutions.
Storage cost: The biggest factor affecting storage cost is the amount of data redundancy. In the short-term storage solution of the main chain DA, data is stored in the form of data synchronization of all network nodes. Any newly stored data needs to be backed up in all network nodes, which has the highest storage cost. The high storage cost in turn determines that in a high TPS network, this method is only suitable for temporary storage. The second is the Sharding storage method, including Sharding in the main chain and Sharding in third-party DA. Since the main chain often has more nodes, there will be more backups for a corresponding Block, so the main chain Sharding solution will have a higher cost. The lowest storage cost is the storage public chain DA that adopts the reward storage method. Under this solution, the amount of data redundancy often fluctuates around a fixed constant. At the same time, a dynamic adjustment mechanism is introduced in the storage public chain DA to attract nodes to store less backup data by increasing rewards to ensure data security.
Data reading speed: The storage speed of data is mainly affected by the storage location of data in the storage space, the data index path, and the distribution of data in the node. Among them, the storage location of data in the node has a greater impact on the speed, because storing data in memory or SSD may result in a reading speed difference of dozens of times. The public chain DA for storage mostly adopts SSD storage, because the load on the chain includes not only the data of the DA layer, but also personal data with high memory usage such as videos and pictures uploaded by users. If the network does not use SSD as storage space, it is difficult to bear the huge storage pressure and meet the needs of long-term storage. Secondly, for third-party DA and main chain DA that use memory state to store data, the third-party DA first needs to search for the corresponding index data in the main chain, and then transfer the index data across the chain to the third-party DA, and return the data through the storage bridge. In contrast, the main chain DA can query data directly from the node, so it has a faster data retrieval speed. Finally, within the main chain DA, the Sharding method requires calling Block from multiple nodes and restoring the original data. Therefore, compared with the short-term storage method without sharding storage, the speed will be slower.
DA layer universality: The universality of the main chain DA is close to zero, because it is impossible to transfer data from a public chain with insufficient storage space to another public chain with insufficient storage space. In third-party DA, the universality of the solution and its compatibility with a specific main chain are a pair of contradictory indicators. For example, in the main chain-specific DA solution designed for a certain main chain, a lot of improvements have been made in the node type and network consensus level to adapt to the public chain, so these improvements will play a huge role in communicating with other public chains. Within the third-party DA, compared with the modular DA, the storage public chain DA performs better in terms of universality. The storage public chain DA has a larger developer community and more expansion facilities, which can adapt to the situation of different public chains. At the same time, the storage public chain DA acquires data more by actively acquiring through packet capture, rather than passively receiving information transmitted from other public chains. Therefore, it can encode data in its own way, realize standardized storage of data streams, facilitate the management of data information from different main chains, and improve storage efficiency.
Storage solution performance comparison. Image source: Kernel Ventures
6. ConclusionAt this stage, blockchain is undergoing a transition from Crypto to a more inclusive Web3, which brings not only the enrichment of projects on the blockchain. In order to accommodate so many projects on Layer1 while ensuring the experience of Gamefi and Socialfi projects, Layer1 represented by Ethereum has adopted methods such as Rollup and Blobs to improve TPS. Among the emerging blockchains, the number of high-performance blockchains is also growing. However, higher TPS not only means higher performance, but also means greater storage pressure in the network. For massive historical data, main chain and various DA methods based on third parties are proposed at this stage to adapt to the growth of storage pressure on the chain. Each improvement method has its own advantages and disadvantages, and has different applicability in different scenarios.
Payment-based blockchains have extremely high requirements for the security of historical data, and do not pursue particularly high TPS. If this type of public chain is still in the preparation stage, a storage method similar to DankSharding can be adopted, which can greatly increase the storage capacity while ensuring security. However, if it is a public chain like Bitcoin that has already taken shape and has a large number of nodes, there is a huge risk in making rash improvements at the consensus layer. Therefore, a main chain-specific DA with higher security in off-chain storage can be adopted to balance security and storage issues. However, it is worth noting that the functions of blockchain are not static but constantly changing. For example, the functions of early Ethereum were mainly limited to payment and the use of smart contracts to perform simple automated processing of assets and transactions. However, with the continuous expansion of the blockchain map, various Socialfi and Defi projects have gradually been added to Ethereum, making Ethereum develop in a more comprehensive direction. Recently, with the outbreak of the inscription ecosystem on Bitcoin, the transaction fee of the Bitcoin network has surged nearly 20 times since August. The reason behind this is that the transaction speed of the Bitcoin network at this stage cannot meet the transaction needs, and traders can only increase the fee to process the transaction as soon as possible. Now, the Bitcoin community needs to make a trade-off: accept high fees and slow transaction speeds, or reduce network security to increase transaction speeds but violate the original intention of the payment system. If the Bitcoin community chooses the latter, then in the face of growing data pressure, the corresponding storage solution also needs to be adjusted.
Bitcoin mainnet transaction fee fluctuations, image source: OKLINK
As for the public chain with comprehensive functions, it has a higher pursuit of TPS, and the growth of historical data is even greater. In the long run, it is difficult to adapt to the rapid growth of TPS by adopting a DankSharding-like solution. Therefore, a more appropriate way is to migrate the data to a third-party DA for storage. Among them, the main chain-specific DA has the highest compatibility, and if only the storage problem of a single public chain is considered, it may be more advantageous. However, with the flourishing of Layer1 public chains today, cross-chain asset transfer and data interaction have also become a common pursuit of the blockchain community. If the long-term development of the entire blockchain ecosystem is taken into account, storing historical data of different public chains on the same public chain can eliminate many security issues in the process of data exchange and verification. Therefore, modular DA and storage of public chain DA may be a better choice. Under the premise of similar versatility, modular DA focuses on providing services at the blockchain DA layer, introduces more refined index data management of historical data, and can reasonably classify different public chain data, which has more advantages than storage of public chains. However, the above solution does not take into account the cost of adjusting the consensus layer on the existing public chain. This process is extremely risky. Once a problem occurs, it may lead to systemic vulnerabilities and cause the public chain to lose community consensus. Therefore, if it is a transitional solution during the expansion of the blockchain, the simplest main chain temporary storage may be more appropriate. Finally, the above discussion is based on the performance during actual operation, but if the goal of a public chain is to develop its own ecology and attract more project parties and participants, it may also tend to be supported and funded by its own foundation. For example, in the case of equal or even slightly lower overall performance than the storage public chain storage solution, the Ethereum community will also tend to Layer2 projects supported by the Ethereum Foundation such as EthStorage to continuously develop the Ethereum ecosystem.
In summary, the functions of today's blockchain are becoming more and more complex, which also brings greater storage space requirements. When there are enough Layer1 verification nodes, historical data does not need to be backed up by all nodes in the entire network. It only needs to reach a certain number of backups to ensure relative security. At the same time, the division of labor in the public chain has become more and more detailed. Layer1 is responsible for consensus and execution, Rollup is responsible for calculation and verification, and a separate blockchain is used for data storage. Each part can focus on a certain function without being restricted by the performance of other parts. However, how much or how much proportion of nodes should be stored to store historical data to achieve a balance between security and efficiency, and how to ensure secure interoperability between different blockchains, these are issues that blockchain developers need to think about and continuously improve. For investors, you can pay attention to the main chain-specific DA project on Ethereum, because Ethereum already has enough supporters at this stage and does not need to rely on other communities to expand its influence. More needs are to improve and develop their own communities and attract more projects to land on the Ethereum ecosystem. However, for public chains in the position of followers, such as Solana and Aptos, the single chain itself does not have such a perfect ecosystem, so they may be more inclined to unite the power of other communities to build a huge cross-chain ecosystem to expand their influence. Therefore, for the emerging Layer1, general third-party DA deserves more attention.
Kernel Ventures is a crypto venture capital fund driven by the research and development community, with more than 70 early-stage investments, focusing on infrastructure, middleware, dApps, especially ZK, Rollup, DEX, modular blockchains, and verticals that will carry billions of future crypto users, such as account abstraction, data availability, scalability, etc. Over the past seven years, we have been committed to supporting the development of core development communities and university blockchain associations around the world.
referencesCelestia: The starry sea of ​​modular blockchain: https://foresightnews.pro/article/detail/15497
DHT usage and future work：https://github.com/celestiaorg/celestia-node/issues/11
Celestia-core：https://github.com/celestiaorg/celestia-core
Solana labs：https://github.com/solana-labs/solana?source=post_page-----cf47a61a9274--------------------------------
Announcing The SOLAR Bridge：https://medium.com/solana-labs/announcing-the-solar-bridge-c90718a49fa2
leveldb-handbook：https://leveldb-handbook.readthedocs.io/zh/latest/sstable.html
Kuszmaul J. Verkle trees[J]. Verkle Trees, 2019, 1: 1.：https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf
Arweave official website: https://www.arweave.org/
Arweave Yellow Paper: https://www.arweave.org/yellow-paper.pdf