Shardeum Launch: A Groundbreaking Achievement
On Monday, January 27, 2024, a groundbreaking event occurred in the world of web3 that can be likened to the historic feat of a spacecraft perfectly returning to its launch pad after its first flight test mission. In this extraordinary scenario, Shardeum, not only faced tough challenges but also emerged victorious, with its network resilience marking the first time a shard network could self-heal in the realm of distributed ledger technology.
Just as a spaceship journey involves meticulous planning, precision engineering, and seamless execution of complex maneuvers, the restoration of Shardeum's Sphinx betanet, which had suffered a critical crash, required an equal level of technological mastery and innovation. The ability to persist all data on a network, especially one that operates using dynamic state sharding is groundbreaking.
As we embark on this exploration, we not only celebrate Shardeum's landmark landing, but also recognize it as a watershed moment in the evolution of web3 technology — a leap that could redefine the boundaries of IT network resilience and data integrity.
The First Shard Network to Recover and Retain Data Independently
Dynamically maintaining and restoring a shard network, such as Shardeum, includes a spectrum of complex technical challenges that differentiate it from traditional blockchain networks such as Bitcoin or Ethereum. In a dynamically expressed shard environment with autoscaling, continuous reallocation and balancing of nodes and resources across different shards is critical to optimizing performance and scalability. These constant changes in network architecture add significant complexity in maintaining data consistency, ensuring network stability, and facilitating effective failure recovery.
The importance of this challenge is emphasized when comparing Shardeum's response to node fluctuations with Bitcoin. The Bitcoin network maintains functionality even with a small number of nodes, as each active node has a complete state and transaction history. In contrast, each active node on Shardeum does not have a complete state and transaction history, due to the Shardeum shard network, and each validator only has a portion of the overall state. The consequence of this sharding is that all validator nodes become very lightweight. Therefore, this creates a multitude of engineering opportunities and challenges. If a node goes down, how do we ensure all data is maintained? Shardeum has two main ways.
First, Shardeum uses dynamic state sharding, where the entire address space is partitioned (or divided) according to the number of active nodes. Each node is responsible for its assigned partition, along with a certain radius (R) around it and additional partitions (E) adjacent to it, ensuring dynamic adaptability and strong data redundancy within the network framework. So, even if a node goes down, there is still network continuity and no data is lost.
Second, Shardeum uses archive nodes to store the complete state of the entire network. This is achieved by active nodes streaming partially stored state to the archiver for collection. Due to these two factors and design optimization, recovering such networks must be designed in new ways to still facilitate beneficial features such as autoscaling and linear scaling.
Understanding Crash
Now that we understand the basics of dynamic state sharding and that archiver nodes are somehow involved, let's break down some of the additional components in more depth and explain them. To understand Shardeum betanet crash and recovery, we must first understand a little about the following:
Archiver node
Detect missing archivers
Network mode
Recovery mode
Understanding the basics of each of these is important before we dive into the bugs involved, so let's take a look!
Archiver Nodes: Inter Stellar Storage
In Shardeum, archiver nodes, also called archivers, stand as a very important category of nodes, tasked with storing the entire state and historical records of the network. Distinguished from active nodes, archivers do not participate in the consensus process; Its main function is to comprehensively archive all network data, including transactions and receipts. The archiver node's contribution is critical to upholding the integrity of the network and ensuring its operations run smoothly, thereby confirming Shardeum's status as a strong, complete, and reliable network. Because archivers are an integral part of its network, Shardeum must have protocols in place to detect unresponsive archivers (and validators).
Lost Archive Detection: Alien Technology Unveiled
Shardeum has a protocol called lost node detection protocol that detects when an active node becomes inoperative — this is only intended for active nodes. However, Shardeum also has a protocol for archivers that does something similar called missing archiver detection. Missing archiver detection is a special protocol designed to handle the rare scenario where one or more archivers becomes inoperative and is marked as missing. Since archiving nodes are critical in maintaining the integrity and accessibility of historical data in the network, it is therefore critical that in the event they become unresponsive or malfunctioning, these critical events can be captured to mitigate downstream effects. Although a missing archiver does not cause this particular crash, the interaction between the missing archiver detection protocol and the particular network mode does. Now let's see what network modes there are in Shardeum.
Network Mode On Shardeum: No NASA Required
The flagship innovation in Shardeum supported by the underlying Shardus protocol is the network mode framework. These modes go beyond basic operational conditions, realizing complex coordination of various node activities, data synchronization methods, and transaction management systems. Such a network configuration plays an important role in maintaining the operational integrity of the network, especially in scenarios characterized by the loss of nodes and data — as Shardeum is a network of shards.
On a simpler level, the best way to understand network mode in Shardeum, is as a well-coded contingency plan that enables continuity of operations for the entire network — even in unlikely conditions such as network crashes or network-wide degradation. This pre-programmed operational resilience and resiliency ensures that Shardeum will always be alive — no matter what difficulties the network faces.
While understanding bugs doesn't require understanding every aspect of the network mode framework, it's helpful to know the basics. The essence of the network mode framework is the incorporation of several different network phases: Establishment, Processing, Security, Recovery, Restart, Recovery, and Outage. These modes are carefully crafted to address various network situations and emergencies. However, the mode we are concerned with in this article is recovery mode.
Reverse Engineering Recovery Mode: Rosewell Revisited
Recovery mode is 1 of the 7 network modes mentioned above. Recovery mode is initiated when the number of network active nodes falls below a predetermined critical threshold (currently configured at 75% or lower). This threshold can be adjusted according to network requirements. In this mode, the network pauses application transaction processing and application data synchronization. This strategy is designed to facilitate network expansion by seamlessly cycling idle nodes as part of node rotation, thereby returning the number of active nodes to optimal levels, ideally above 100%.
During recovery mode, Shardeum's network architecture allows gradual node upgrades, limited to 20% growth per cycle (each cycle is approximately 60 seconds). This controlled growth rate is critical to maintaining network stability and ensuring smooth integration of new nodes. A rapid increase in the number of nodes, such as a 50% spike, has the potential to destabilize the network and complicate the integration process.
Each newly added node requires network resources for synchronization and integration. By limiting upgrades to 20% per cycle, the network ensures that its infrastructure can adequately support the addition of new nodes without strain. This approach not only maintains network stability but also minimizes the risk of data inconsistencies or errors during the synchronization process, thereby maintaining the integrity of the cycle chain data.
Root Causes of Crashes: Event Horizon
It's important to note that there are two different bugs. Neon library bug — which caused validators to randomly crash and a bug in the missing archiver detection protocol — that would not accept an empty validator list. Although it is the missing archiver detection protocol bug that causes the current version of Betanet to crash, I would like to invite you to discuss the neon library bug first.
In Sphinx version 1.9.1, we integrated an update to the library that uses the Neon binder to link Rust and TypeScript functions because Shardeum is primarily built in TypeScript. Neon is known for its innovative, albeit experimental, approach that often pushes the boundaries of conventional software development practices. This integration aims to improve interoperability between these two languages, enabling more efficient and direct communication within our software architecture. However, this causes a bug that causes nodes to randomly drop out of the network.
Second, in the recent incident that caused a betanet crash on Shardeum, the root cause was identified as a critical anomaly originating from the interaction between the two different subsystems mentioned above: the missing archiver detection mechanism and the network recovery mode protocol.
This brief crash was triggered by the simultaneous activation of these two mechanisms, a scenario never encountered or tested before. The lost archive process is triggered in conjunction with the network recovery mod and due to a bug in the lost archive mode that does not accept an empty list of active nodes. This leads to a network crash.
Recovery Chronicles: From Systemic Shock to Stellar Awakening
So what actually happened and when? A timeline of events surrounding the network crash and its resolution is as follows:
Vulnerabilities and Initial Upgrade: The network has a vulnerability flagged by the 1.9.1 linting process in the npm (neon) library. An improvement has been implemented to address this issue. However, this improvement inadvertently raised an exception that was not reproduced during local testing.
Intermittent Library Exceptions Causing Validator Outages: The library, neon, experiences sporadic exceptions that cause periodic network validator outages. Although the network design allows for resiliency through refilling of these validators, and unfortunately simultaneous failure times among multiple validators triggers network recovery mode.
Triggering Network Recovery Mode: Once in network recovery mode, the protocol must clean up and recreate the list of active nodes. A concurrent bug in the missing filing system, which did not accommodate empty validator lists, was the primary cause of network crashes.
Network Resolution and Recovery: The crash was fixed and the network was successfully restored using the data stored in the archiver. This is the first time in history that a crashed Layer 1 shard network was successfully recovered and all data on the network was preserved intact. This has never been done on any network, let alone a network with dynamic state sharding. This achievement marks a successful “rocket landing” in terms of network recovery.
Fixes Completed: Initial fixes were implemented to address library issues, but in an ongoing effort to improve network stability, version 1.9.5 was released. This update introduces a single but important bug fix that addresses another instance of neon binding crash, pinpointing and fixing the specific vulnerability without requiring a network-wide upgrade. Initially, users operating on version 1.9.4 have the flexibility to remain on the current version or choose to upgrade to 1.9.5, based on their assessment of network performance and stability preferences. However, it was ultimately decided that for the purpose of improving network stability and addressing persistent issues related to neon binding, the minimum version required for the validator should be increased to 1.9.5. This update aims to systematically exclude validators running on version 1.9.4, which have been identified as vulnerable to crashes due to the aforementioned neon binding complications. This is necessary to ensure that the neon bug has been completely removed and completely fixed.
Now that we know about the timeline and how the major events happened, let's take a peek at what happened so that the network can be restored quickly.
Soaring towards Recovery
Agile Recovery
Network recovery consists of many parts, but one of the main ones is Shardeum recovery mode. As previously stated, recovery mode is initiated when the number of network active nodes falls below a predetermined critical threshold and allows rapid, controlled, and effective network growth in a safe way to restore the network. It is important to emphasize that without the technological ingenuity of network mode designers and developers — Shardeum would not have been able to recover from the crash so easily and also demonstrate its innovative prowess.
Additionally, significant efforts were made by Shardeum's technology team as they launched immediate action. The initial step involved a thorough analysis to identify the root cause of the crash , which was traced to an anomaly in the interaction between the network's missing archive detection and its recovery mode system. Understanding the complexity of the issue, the team quickly implemented a multi-faceted approach to address the immediate impacts and underlying vulnerabilities.
Mixed Response from Shardeum Technology Team
Technically, the response was mixed: first, the team isolated the affected components to prevent further tissue degradation. At the same time, they applied a patch to fix a bug in the missing filing system, ensuring it could handle an empty validator list — a critical error that triggered the network failure. To restore the network to full operational capacity, the data stored in the archive is then activated and used to reconstruct the condition of the network before the crash, ensuring no data is lost in the process.
Logistically, the team coordinates across time zones and disciplines, leveraging cloud-based tools for collaboration and real-time monitoring. This coordinated effort not only facilitates rapid development and implementation of fixes, but also ensures that all team members are aligned on the recovery process and next steps.
This incident served as a rigorous test of Shardeum's crash management protocols and highlighted the importance of agile and innovative responses to unexpected challenges. This underscores the team's commitment to maintaining a resilient and secure network, ready to overcome complex technical obstacles as they arise.
Safe & Space-Bound Landing Innovations
In conclusion, the successful recovery of the Shardeum shard network marks a significant shift in network technology, marking a milestone with far-reaching implications for the industry. While currently little known, innovations like network mode will eventually set new industry standards across web3.
It has long been my belief that Shardeum's core innovations are very likely to influence future technological developments, inspiring innovation and a new generation of ledger technology. Having been a first-hand witness to the first-ever Shardeum network recovery, I knew that this would be a catalyst for re-evaluating industry standards, potentially leading to the adoption of more stringent protocols and methodologies in network design and architecture.
This event not only showcased the technical prowess and innovation of the Shardeum team, but also signaled the beginning of an era where decentralized networks become more robust, adaptable, and able to handle unforeseen challenges when it comes to disaster recovery planning. Ultimately, Shardeum technology will herald a new era of decentralization.