When the issue occurred, the Sui engineering team quickly diagnosed the problem and released a fix, which was subsequently deployed by the validator nodes, minimizing network downtime.

Event Overview

Between approximately 1:15 AM and 3:45 AM Pacific Time on November 21, 2024 (5:15 PM to 7:45 PM China Standard Time), the Sui mainnet experienced a complete network stall. All validator nodes entered a crash loop, causing a total interruption of transaction processing.

Cause of the Issue

An assert! in the block control code triggered an error: If the estimated execution cost is zero, it causes the validator node to crash. This issue occurs under the following conditions:

1. Block control set to TotalGasBudgetWithCap mode:

  • This mode was briefly enabled in protocol version 63 before being revoked, and then re-enabled with the cumulative scheduler in protocol version 68.

2. The network received transactions that simultaneously included the following conditions:

  • Variable shared object input

  • Zero MoveCall instructions

When the network receives such transactions, all validator nodes crash immediately.

What is block control?

The Sui network's object-based architecture supports massive parallel processing of different user transactions, which is not feasible in most other networks. However, if multiple transactions write to the same shared object simultaneously, those transactions must be executed sequentially, and there is a limit to the transaction processing volume involving that specific object.

The block control system prevents the network from being overloaded by limiting the transaction rate that writes to the same shared object, avoiding long execution time checkpoints.

We recently upgraded the block control system to improve the utilization of shared objects by estimating transaction complexity more accurately. However, there was a bug in the code of the new TotalGasBudgetWithCap mode that led to this issue.

How to resolve the issue?

Once the issue was identified, code fixes were straightforward (see PR #20365). The fix has been deployed to the mainnet (v1.37.4) and testnet (v1.38.1).

PR #20365: Modified bump_object_execution_cost to use saturated addition and allow 0-cost transactions.

🌟 Mainnet v1.37.4: https://github.com/MystenLabs/sui/releases

Thanks to the active response from the validator node community, it took only 15 minutes from the release of the fix to the Sui network returning to normal.

What have we learned?

  • The event detection and response system worked well: Automatic alerts and community reports were triggered almost simultaneously, and we quickly mobilized team resources for diagnosis and repair.

  • The validator node community performed excellently: After the fix was released, the Sui network returned to normal almost immediately.

Preventive Measures

  1. Improve testing system: Increase the number of adversarial transaction types similar to those that triggered this crash to discover potential issues.

  2. Optimize build process: Increase the speed of generating debug and release binaries, further reducing event response time. A portion of the downtime during this incident was due to waiting for the build release.

Thanks to the support of the community and validator nodes, we ensured a rapid recovery of the Sui network!