In this three-part series, we will reveal the technological achievements that can significantly improve data processing for applications running on the Internet Computer Protocol (ICP).
This upgrade is a milestone in the ICP roadmap called Stellarator, which is currently being rolled out across the entire network. Stellarator represents a breakthrough in on-chain data storage, enabling each subnet to host over 1TB of memory, opening up opportunities for data-rich applications previously constrained by storage limits.
This advancement enables developers to build complex applications that require large-scale data processing, bringing new practical levels to blockchain technology.
Without further ado, let's begin this series and see how ICP currently uses Stellarator updates to store data.
Data persistence on the Internet Computer
This blog post outlines how the replicas on the Internet Computer work, focusing particularly on the recent changes introduced by the Log-Structured Merge Tree (LSMT) based storage, which includes implementing more replicated storage in Internet Computer subnets and enabling them to better handle heavy workloads.
The Internet Computer consists of subnets and virtual machines, and these virtual machines perform the same replication across 13-40 replica machines, with each replica responsible for executing all messages sent to that subnet and storing all container data, so all replicas have the complete and identical state of the subnet.
Developers can deploy containers on the Internet Computer, which are similar to smart contracts on other blockchains but can perform more general computations and store significantly larger amounts of data compared to smart contracts on other chains.
The data saved in containers eventually needs to be stored on some physical hardware. Thanks to the recently introduced LSMT-based storage layer and many other optimizations and improvements, subnets on the Internet Computer can store up to 1TB of container data.
Most of the data for containers is stored either in their heap memory (up to 4GB when written) or in their stable memory (up to 500GB when written). Other forms of data related to containers include container code, in-flight messages, and various information such as controller lists and Cycles balances.
The storage layer of ICP bridges the gap between container storage (such as heap and stable memory) and the underlying storage hardware of the replica machines (such as disks and RAM).
As part of the Stellarator milestone, the storage layer underwent extensive redesign and reimplementation to enable the ICP to address future scalability challenges and resolve the most critical scalability bottlenecks of the old storage layer. All relevant changes have recently been completed, and the Internet Computer has been running the new storage layer implementation for several months.
The remainder of this blog post discusses the process of redesigning the storage layer as a Log-Structured Merge Tree data structure, aimed at eliminating storage-related bottlenecks and providing a better experience for users and developers of storage-intensive containers.
For ICP users, the most notable aspect of this work is that the replicated state in a single subnet has recently increased to 1TB. Moreover, the redesign enables the Internet Computer to better handle containers that write large amounts of data.
Checkpoints
In general, the storage layer of the Internet Computer combines persistent storage on disk with temporary storage in RAM. A key concept in how the Internet Computer stores its state is the so-called checkpoint, which is a logical point in time when the entire state of the subnet is stored on disk.
Checkpoints are created deterministically every 500 blocks or every few minutes, meaning all replicas will write the same checkpoint at the same height. Checkpoints are stored as file directories on the disks of each replica node, with the directory structure simplified as follows:
In this structure, each container is stored in its own subdirectory, and each container directory contains separate files for heap memory, stable memory, and other information (such as in-flight messages). There are various reasons why data is saved to disk in the form of checkpoints.
1. Data persistence: Replica machines may restart at any time, there may be software bugs in the replica code, hardware may fail, or there may be power issues in the data center. In such cases, the latest checkpoint can be reloaded.
Note that even though checkpoints are created only every 500 rounds, replicas can recreate state for non-checkpoint heights. Replicas only need the latest checkpoint and all final blocks between the checkpoint and the latest state. Since all executions are deterministic, these blocks can be replayed, guaranteeing that the recreated state is identical. The necessary blocks can be kept separate from the checkpoint on disk or fetched from other replicas.
2. Synchronization: All (replicated) messages are executed by all replicas, so for any given height h, all replicas should have the same state. The protocol prevents divergence (i.e., when some honest replicas end up in a different state from the consensus state) by first hashing the state and then threshold signing the resulting hash. A threshold signature can only be created when at least ⅔ (more accurately, 2f + 1) of the replicas agree on the same hash, allowing the subnet to continue operating.
However, subnets on the Internet Computer can have larger states, limited to 1TB at the time of writing. It is not feasible to hash so much data after maintaining a maximum of 2.5 blocks per second (the fastest subnets on the Internet Computer currently achieve this), thus the Internet Computer only hashes selected portions of state at non-checkpoint heights, such as basic information about each container, recent responses to incoming messages, and XNet messages sent to other subnets.
For checkpoints, the protocol hashes the entire state to obtain a data structure called the manifest. This manifest is calculated by hashing all files in the checkpoint directory and includes hashes of all individual files chunked into 1MB blocks. At the end of the manifest calculation, the root hash of the manifest is calculated, covering all individual hashes in the manifest, which is then threshold signed by the subnet. The calculation of the manifest may take several seconds, but this work is only done once every 500 blocks and runs in parallel with container executions in the background.
3. State synchronization: The Internet Computer allows changes to subnet topology through NNS proposals. When replica nodes join a subnet, they can fetch the latest checkpoint from other replica nodes. Recall that checkpoints are collections of files; thus, after verifying the root hash and threshold signature of the subnet, the state synchronization protocol can fetch files block by block while comparing the hashes of the fetched blocks with those in the manifest. If all checks succeed, the replica can conclude that the fetched state corresponds to the agreed state of the various subnets at the checkpoint height. If a replica falls behind for other reasons and cannot catch up with pure block replay due to a significant gap with the healthy state, state synchronization will also be triggered.
4. Maximum state size: Currently, the maximum state size is 1TB, while the RAM of replica node machines is 512GB. Therefore, it is impossible to load the entire state into RAM. The Internet Computer primarily uses RAM to hold the latest data that has not yet been persisted and to cache data for performance improvements.
PageMap and non-checkpoint heights
Since checkpoints are only created every 500 blocks, the ICP needs to provide different storage for the state at non-checkpoint heights. The fundamental idea followed by the storage layer is that this data is stored on disk in the form of the last checkpoint, while any changes thereafter are stored in RAM.
PageMap implements most of the content for the state of the subnet. The vast majority of the subnet state is the state of containers, particularly the heap and stable memory of containers.
Currently, containers can have up to 4GB of heap memory, 500GB of stable memory, and the total subnet state limit is 1TB, but all these limits may change in the future. Both types of memory can be read through copy (update) and non-copy (query) container calls, and can also be modified through copy calls.
The PageMap data structure is designed for efficient read and write of memory, as well as for supporting efficient writes to checkpoints. A specific goal is to make performance independent of the total size of memory. Note that the name PageMap comes from the fact that the granularity of all reads and writes is a 4KB page, which is the same size as the pages used by the underlying operating system.
PageMap stores state in two layers: the first layer, called storage, consists of files from the previous checkpoint, representing the state at the previous checkpoint height, while the second layer, the page deltas, represents all changes since that checkpoint and is stored in RAM.
When reading from PageMap, the returned data is either fetched from the page deltas or read from the checkpoint file (if missing). Writing to PageMap is accomplished by modifying the page deltas with the new data.
Checkpoint lifecycle
The main task of the storage layer is to write checkpoints to disk and keep all PageMaps up to date. When writing a new checkpoint, all page deltas are flushed to disk, thus updating the storage portion of all PageMaps.
To ensure that all data is saved, replicas need to always retain the latest checkpoint that has been threshold signed. This means that it is not possible to simply overwrite old checkpoint files since any such modification would change the previous checkpoint before the new checkpoint is fully written, risking data loss. Furthermore, the cost of writing a full checkpoint to disk (up to 1TB) every 500 rounds would be exceedingly high. Instead, creating new checkpoints includes three basic steps:
Copy the old checkpoint to a temporary folder, called a tip;
Modify the tip to represent the data of the new checkpoint;
Rename the tip to the new checkpoint directory (to create a new checkpoint atomically).
The first step may be the most expensive one, as the larger the state, the longer it takes to copy the files, thus the checkpoint files become larger.
However, this is precisely where the file format of checkpoints and Log-Structured Merge Trees comes into play. By using LSMT, this step is inexpensive and does not scale with state size. In contrast, before the LSMT redesign of the storage layer, this step was slow and unpredictable.
Log-Structured Merge Tree
Log-Structured Merge Trees (LSMT) are a widely used data structure, especially suitable for databases. On the Internet Computer, they serve as the basis for the storage portion of PageMaps, addressing a specific problem: simplifying the 'copy' step of the checkpoint lifecycle, as all files are written only once and never modified.
Using LSMT, the (logical) state is modified by writing additional overlay files. To read the value of a page, the algorithm first checks the most recently written overlay file to see if that file contains the page. If it does, the page's value is read from that overlay file; otherwise, the next older overlay file is checked. In the implementation used on the Internet Computer, if a page does not exist in any overlay file, it is read as all zeros.
The diagram below shows a set of three overlay files representing the state of the container, with vertical arrows indicating different reads of the data and the files that ultimately read the data.
The lifecycle of LSMT checkpoints is as follows:
Hard link all files from the previous checkpoint to a temporary folder, called a tip;
Write a new overlay file containing all changes since the last checkpoint;
Rename the tip to the new checkpoint directory (for atomicity).
The key is that each overlay file is written only once and never modified, so it is safe to set multiple checkpoints on disk and share some files between them via hard links. Note that if the checkpoint lifecycle attempts to modify any file, this same process will not work; if a file has multiple hard links, modifying any one of them will change the data in all hard links, thus tampering with previously certified checkpoints.
Log-Structured Merge Trees can store multiple versions of the same range of data, leading to storage overhead, defined as the size of all overlay files relative to the logical size of the data represented. Additionally, data may be spread across multiple files.
As storage overhead or the number of overlay files increases, the storage layer implementation will schedule merges. Merges reorganize data by taking a set of overlay files and replacing them with a single file that contains only the latest version of each piece of data. Merges are scheduled for PageMaps with particularly high storage overhead or file counts and are executed in the background to avoid interfering with container message execution.
Previous design using Reflinks
The original storage layer of the Internet Computer did not rely on LSMT. From the inception of the Internet Computer in 2021 to 2024, the storage layer heavily relied on relinking.
Relinking, sometimes referred to as copy-on-write, is a file system operation used to duplicate files. Some file systems support this operation, and the replicas of the Internet Computer use the XFS file system that supports this operation. Relinking differs from regular file copy in that it does not copy the entire file content; instead, the file system remembers what data is shared between the original file and the new file. Relinking also differs from hard linking in that both files can be modified independently of each other.
In the old storage layer design, the lifecycle of checkpoints worked as follows:
Relink all files from the previous checkpoint to a temporary folder, called a tip;
Modify the files in the tip based on all changes since the last checkpoint;
Rename the tip to the new checkpoint directory.
One potential advantage is that the PageMap will be represented by a single file in the checkpoint, thus avoiding storage overhead. However, modifying the file to fit the new checkpoint height requires relinking the relevant files of the previous checkpoint, rather than hard linking.
Advantages of LSMT over Reflinks
In principle, relinking guarantees the speed of hard links and the availability of replication. Unfortunately, as the data demands of the Internet Computer continue to increase (whether in I/O throughput or total data volume), relinking has become a performance bottleneck for various reasons.
1. Slow relinking speed: The time required to relink files can vary widely. In some cases, it may take only a few seconds, but in experiments, we also observed that relinking 370GB could take up to 10 hours. In the old checkpoint logic of the Internet Computer, a relinking step of 10 hours would halt the entire subnet for 10 hours.
Fragmentation leads to poor relinking speeds. Internally, the XFS file system maintains a data structure that maps various parts of a file to actual data blocks on the disk. Fragmentation and slower relinking speeds occur when the cost of traversing these data structures becomes high.
Fragmentation is particularly likely to be triggered by the following sequence: a large number of writes to a file, then relinking it, a large number of writes to one of the replicas, relinking it again, and so on. Unfortunately, given the checkpoint workflow on the Internet Computer, heavily used containers happen to trigger this behavior.
On the other hand, hard links have consistent performance, unaffected by any fragmentation issues.
Before the introduction of the Log-Structured Merge Tree, many expedient measures were implemented, including manually defragmenting files (by reading files and writing them back) and writing files into larger contiguous sections. Both approaches came at the cost of higher write amplification and could still lead to pauses lasting up to 30 minutes.
2. Maximum state size: Aside from the extremely long relinking times caused by excessive fragmentation, even moderate fragmentation will cause relinking times to be proportional to the total amount of data stored in the subnet. In previous implementations of the storage layer, the Internet Computer stipulated that each subnet's storage should not exceed 700GB, a figure largely dependent on how much moderate fragmentation data could be relinked within 20-30 seconds.
When using hard links, the checkpoint time does not scale with data size in the same way, thus eliminating this bottleneck.
3. Poor multithreading performance: One of the goals of the checkpoint logic is to avoid synchronous operations whenever possible, as container execution pauses during checkpoints. Naturally, the consideration is whether relinking can occur in the background while execution continues (even if slowly). Unfortunately, experience shows that it is not possible to relink and read files simultaneously, so relinking in parallel with execution only slows down execution.
In the new storage layer design, hard linking occurs in parallel with execution, and they do not slow each other down.
4. Cache: As mentioned above, reading data from container memory involves reading data from both RAM and the underlying checkpoint files. Repeatedly reading the same file typically does not necessitate multiple reads from the actual hard disk or SSD. Instead, this data is cached by the operating system. Unfortunately, relinking interferes with the cache because first relinking the file and then reading from the new replica does not utilize the cache. Therefore, on the Internet Computer, a large (and slow) spike in disk reads can be observed after writing a checkpoint, as all reads switch to the new file. Additionally, explicit calculations (i.e., calculating the hashes of all checkpoint files to reach consensus) also benefit greatly from better caching.
In contrast, when hard linking files, all caches are preserved, meaning any recently read or written data, even if it occurred during previous checkpoint intervals, will still be cached for quick retrieval.
Results
The LSMT storage layer was rolled out to all subnets of the Internet Computer over a few weeks in the second quarter of 2024. The diagram below shows the metrics before and after the upgrade of the replica code, with the red vertical line indicating the time LSMT storage layer was enabled.
The first diagram shows the checkpoint times of the w4rem subnet, which hosts containers for Bitcoin integration. Compared to many other subnets, the heap and stable memory of the containers hosted on the Bitcoin subnet have write-heavy workloads, making fragmentation a particularly concerning issue for this subnet.
From the metrics, the checkpoint time has been reduced from over 20 seconds to just 1-2 seconds, mainly due to the elimination of the relinking step, which accounted for most of the 20 seconds.
For Bitcoin container users, the benefit is faster response times. During checkpoints, the subnet does not process any update calls or copy queries. If a user sends an update call or copy query to the Bitcoin container at the start of the checkpoint, it will take at least 20 seconds to receive a response. Using the LSMT storage layer can essentially eliminate such inconsistent response times.
The second diagram shows the finalization rate on the k44fs subnet, where the finalization rate, or block rate, is the number of blocks produced by the subnet per second.
The Internet Computer limits the number of instructions executed per round to roughly correspond to the amount of work it can complete in one second, ensuring that the finalization rate can remain above one block per second.
Before upgrading to the LSMT storage layer, the completion rate regularly declined, which corresponded exactly with checkpoints. There are two main reasons why checkpoints affect the completion rate: first, the time required to create checkpoints, during which no blocks are executed. After the upgrade, this impact reduces as checkpoint times are usually much shorter.
The second reason is that the caching behavior of the LSMT storage layer is better. Specifically, the relinking step in the old storage layer implementation caused cache invalidation. Therefore, after a checkpoint, any container that read from memory would cause replicas to fetch that data from disk, which is several orders of magnitude slower than the same data available in RAM cache. The new LSMT storage layer does not have this issue.
As indicated by the metrics, the decline in finalization rates after the upgrade is significantly smaller, as the speed of checkpoints themselves has improved, relinking is no longer necessary, and file caching is no longer invalidated.
For users of containers on this subnet, this means faster response times and higher throughput.
Conclusion
Redesigning the storage layer of the Internet Computer around a Log-Structured Merge Tree data structure is an important investment in platform scalability and performance. It not only enables some memory-intensive workloads but also allows the Internet Computer to provide larger states for containers.
In the context of running large language models on-chain and artificial intelligence, containers that operate on large datasets are particularly interesting. These containers not only rely on massive data storage but also heavily depend on I/O throughput.
Additionally, it lays the groundwork for subsequent improvements, such as alleviating heavy workloads on the critical path by better utilizing concurrency, making the Internet Computer faster. Experience with the original storage layer shows that avoiding relinking is crucial to achieving this, and the LSMT data structure can do just that.
Did you enjoy this article? Share your thoughts on the DFINITY Developers X channel, and join us tomorrow for Part 2 of our Stellarator journey, where we will explore enhanced orthogonal persistence with Luc Bläser.
IC content you care about
Technological advancements | Project information | Global events
Follow the IC Binance channel
Stay updated