Event Overview
Between approximately 1:15 AM and 3:45 AM Pacific Time on November 21, 2024 (5:15 PM to 7:45 PM GMT+8 on November 21, 2024), the Sui mainnet experienced a complete network stall. All validator nodes fell into a crash loop, resulting in a complete halt of transaction processing.
Cause of the issue
The assert! in the blocking control code triggered an error: if the estimated execution cost is zero, it causes the validator node to crash. This issue occurs under the following conditions:
1. Blocking control set to TotalGasBudgetWithCap mode:
This mode was briefly enabled in protocol version 63 and then revoked, only to be re-enabled later with the cumulative scheduler in protocol version 68.
2. The network received transactions that simultaneously contained the following conditions:
Variable shared object input
Zero MoveCall instructions
When the network receives such transactions, all validator nodes crash immediately.
What is blocking control?
The Sui network's object-based architecture supports massive parallel processing of different user transactions, which is not achievable in most other networks. However, if multiple transactions write to the same shared object simultaneously, those transactions must be executed in sequence, and there is a limit to the transaction processing volume involving that specific object.
The blocking control system prevents the network from being overloaded due to long-running checkpoints by limiting the transaction rate that writes to the same shared object.
We recently upgraded the blocking control system to improve the utilization of shared objects by more accurately estimating transaction complexity. However, there is a bug in the code of the new mode TotalGasBudgetWithCap, which led to this issue.
How to resolve the issue?
Once the issue was established, the code fix was straightforward (see PR #20365). This fix has been deployed to the mainnet (v1.37.4) and testnet (v1.38.1).
PR #20365: Modified bump_object_execution_cost to use saturated addition and allow 0-cost transactions.
🌟 Mainnet v1.37.4:
https://github.com/MystenLabs/sui/releases
In response to the active engagement of the validator node community, it took only 15 minutes from the release of the fix to the Sui network's normal operation.
What did we learn?
The event detection and response system is working well: automatic alerts and community reports were triggered almost simultaneously, and we quickly mobilized team resources for diagnosis and repair.
The validator node community performed excellently: the Sui network returned to normal almost immediately after the fix was released.
Preventive measures
Improved testing system: Add more adversarial transaction types similar to those that triggered this crash to discover potential issues.
Optimized build process: Increased the speed of generating debug and release binaries to further reduce event response time. Part of the downtime during this interruption was due to waiting for the build release version.
Thank you to the community and validator nodes for their support, which ensured a rapid recovery of the Sui network!