In a previous post, we discussed Advanced Formal Verification of Zero-Knowledge Proofs: How to Verify a ZK Instruction. By formally verifying each zkWasm instruction, we are able to fully verify the technical security and correctness of the entire zkWasm circuit. In this post, we will focus on the vulnerability discovery perspective, analyzing the specific vulnerabilities found during the audit and verification process, as well as the lessons learned from them. For a general discussion on advanced formal verification of zero-knowledge proof (ZKP) blockchains, see the Advanced Formal Verification of Zero-Knowledge Proof Blockchains post.

Before discussing ZK vulnerabilities, let’s first understand how CertiK performs ZK formal verification. For a complex system like the ZK Virtual Machine (zkVM), the first step of formal verification (FV) is to clarify what needs to be verified and its nature. This requires a comprehensive review of the design, code implementation, and test setup of the ZK system. This process overlaps with regular audits, but the difference is that the goals and nature of the verification need to be established after the review. At CertiK, we call this a verification-oriented audit. Auditing and verification work is usually a whole. For zkWasm, we performed both auditing and formal verification on it.

What is a ZK vulnerability?

The core feature of a zero-knowledge proof system is that it allows a short cryptographic proof of a computation (such as a blockchain transaction) that was performed offline or privately to be passed to a zero-knowledge proof verifier, which checks and confirms that the computation actually occurred as claimed. In this regard, a ZK vulnerability would allow a hacker to submit a forged ZK proof proving a false transaction and have it accepted by a ZK proof checker.

For the zkVM prover, the ZK proof process involves running the program, generating a record of each step, and converting the data from the execution record into a table of numbers (a process called "arithmization"). These numbers must satisfy a set of constraints (i.e., "circuits"), which include connection equations between specific table cells, fixed constants, database search constraints between tables, and polynomial equations (i.e., "gates") that must be satisfied between each pair of adjacent table rows. On-chain verification can thus confirm that there is indeed a table that satisfies all constraints, while ensuring that the specific numbers in the table cannot be seen.

zkWasm arithmetic table

The accuracy of each constraint is critical. An error in any constraint, whether weak or missing, could allow a hacker to submit a misleading proof that appears to represent a valid run of a smart contract when it does not. These vulnerabilities are amplified by the opacity of zkVM transactions compared to traditional VMs. In non-ZK chains, the computational details of transactions are publicly recorded on the blockchain; zkVM does not store these details on-chain. The lack of transparency makes it difficult to determine the specific circumstances of an attack, or even whether an attack has occurred.

The ZK circuits that execute zkVM instruction rules are extremely complex. For zkWasm, the implementation of its ZK circuits involves more than 6,000 lines of Rust code and hundreds of constraints. This complexity often means that there may be multiple vulnerabilities waiting to be discovered.

zkWasm circuit architecture

Indeed, we found multiple such vulnerabilities through our audit and formal verification of zkWasm. Below, we will discuss two representative examples and discuss the differences between them.

Code vulnerability: Load8 data injection attack

The first vulnerability involves zkWasm's Load8 instruction. In zkWasm, reading heap memory is done via a set of LoadN instructions, where N is the size of the data to be loaded. For example, Load64 should read 64 bits of data from a zkWasm memory address. Load8 should read 8 bits of data (i.e. a byte) from memory and pad it with a 0 prefix to create a 64-bit value. zkWasm internally represents memory as an array of 64-bit bytes, so it needs to "pick" a portion of the memory array. Four intermediate variables (u16_cells) are used for this purpose, which together form the full 64-bit load value.

The constraints for these LoadN instructions are defined as follows:

This constraint is divided into three cases: Load32, Load16, and Load8. Load64 has no constraint because the memory unit is exactly 64 bits. For the case of Load32, the code ensures that the upper 4 bytes (32 bits) in the memory unit must be zero.

For the case of Load16, the upper 6 bytes (48 bits) in the memory location must be zero.

In the case of Load8, the upper 7 bytes (56 bits) of the memory location should be zero. Unfortunately, this is not the case in the code.

As you can see, only the upper 9 to 16 bits are constrained to be zero. The other upper 48 bits can be any value and still pretend to be "read from memory".

By exploiting this vulnerability, a hacker can tamper with the ZK proof of a legitimate execution sequence, causing the execution of the Load8 instruction to load these unexpected bytes, resulting in data corruption. And, by carefully arranging the surrounding code and data, it is possible to trigger false operations and transfers, thereby stealing data and assets. Such forged transactions can pass the zkWasm checker and be mistakenly identified as real transactions by the blockchain.

Fixing this vulnerability is actually quite simple.

The vulnerability is representative of a class of ZK vulnerabilities known as “code vulnerabilities” because they stem from how the code is written and can be easily fixed with small, local code changes. As you might agree, these vulnerabilities are also relatively easy to spot.

Design Vulnerability: Spoofed Return Attack

Let's look at another vulnerability, this time regarding zkWasm's call and return. Call and return are fundamental VM instructions that allow one execution context (e.g. a function) to call another, and resume execution of the caller context after the callee has completed its execution. Each call is expected to return once later. The dynamic data that zkWasm tracks during the lifecycle of a call and return is called a "call frame". Since zkWasm executes instructions sequentially, all call frames can be sorted based on when they occurred during execution. Below is an example of call/return code running on zkWasm.

Users can call the buy_token() function to buy tokens (perhaps by paying or transferring something else of value). One of its core steps is to actually increase the token account balance by 1 by calling the add_token() function. Since the ZK prover itself does not support the call frame data structure, the execution table (E-Table) and jump table (J-Table) are needed to record and track the complete history of these call frames.

The figure above illustrates the execution process of buy_token() calling add_token(), and the process of returning from add_token() to buy_token(). As you can see, the token account balance has increased by 1. In the execution table, each execution step occupies a row, which includes the call frame number currently being executed, the current context function name (only for the description here), the number of the currently running instruction in the function, and the current instruction stored in the table (only for the description here). In the jump table, each call frame occupies a row, and the table stores the number of its caller frame, the caller function context name (only for the description here), and the next instruction position of the caller frame (so that the frame can return). In both tables, there is a jops column that tracks whether the current instruction is a call/return (in the execution table) and the total number of call/return instructions that occurred in the frame (in the jump table).

As one would expect, each call should have a corresponding return, and each frame should have only one call and one return. As shown in the figure above, the jops value of frame 1 in the jump table is 2, which corresponds to rows 1 and 3 in the execution table, where the jops value is 1. So far, everything looks good.

But there's actually a problem here: while one call and one return will give a frame jops count of 2, two calls or two returns will also give a count of 2. Having two calls or two returns per frame may sound ridiculous, but keep in mind that this is exactly what the hacker is trying to do by breaking expectations.

You might be a little excited now, but have we really found the problem?

It turns out that the two calls are not a problem because the constraints of the execution table and the call table prevent the two calls from being encoded into the same frame row, because each call will generate a new frame number, that is, the current call frame number plus 1.

The case of two returns is not so lucky: since no new frame is created when returning, it is indeed possible for a hacker to obtain the execution table and call table of a legitimate running sequence and inject a forged return (and the corresponding frame). For example, the example of buy_token() calling add_token() in the previous execution table and call table can be tampered by a hacker to the following:

The hacker injected two fake returns between the original call and return in the execution table, and added a new fake frame row in the call table (the run step number of the original return and subsequent instructions needs to be increased by 4 in the execution table). Since the jops count of each row in the call table is 2, the constraint is satisfied and the zkWasm proof checker will accept the "proof" of this fake execution sequence. As can be seen from the figure, the token account balance increased 3 times instead of 1. Therefore, the hacker was able to get 3 tokens for the price of paying 1 token.

There are several ways to solve this problem. An obvious way is to track the call and return separately and make sure there is exactly one call and one return per frame.

You may have noticed that we have not shown a single line of code for this vulnerability so far. The main reason is that there is no single line of code that is problematic and the implementation is completely compliant with the table and constraint design. The problem is in the design itself, and so is the fix.

You might think that this loophole should be obvious, but it is not. This is because there is a gap between "two calls or two returns will also result in a jops count of 2" and "actually two returns are possible". The latter requires a detailed and complete analysis of the various constraints related to the execution table and the call table, which is difficult to perform complete informal reasoning.

Comparison of the two vulnerabilities

For the "Load8 data injection vulnerability" and the "forged return vulnerability", both can allow hackers to manipulate ZK proofs, create false transactions, deceive proof checkers, and steal or hijack. But their nature and the way they are discovered are completely different.

The "Load8 data injection vulnerability" was discovered during an audit of zkWasm. This was no easy task, as we had to review over 6,000 lines of Rust code and hundreds of constraints for hundreds of zkWasm instructions. Despite this, the vulnerability was relatively easy to find and confirm. Since the vulnerability was fixed before formal verification began, it was not encountered during verification. If the vulnerability was not found during the audit, we can expect to find it during verification of the Load8 instruction.

The “spurious return vulnerability” was discovered during formal verification after the audit. Part of the reason we failed to discover it during the audit is that zkWasm has a mechanism called “mops” that is very similar to jops, which tracks the memory access instructions corresponding to the history of each memory cell during the zkWasm run. The count constraint for mops is indeed correct because it only tracks one type of memory instruction, namely memory writes; and the history of each memory cell is immutable and is only written once (mops count is 1). But even if we noticed this potential vulnerability during the audit, without rigorous formal reasoning about all relevant constraints, we would still not be able to easily confirm or rule out every possible case because no line of code is actually wrong.

In summary, these two types of vulnerabilities are "code vulnerabilities" and "design vulnerabilities" respectively. Code vulnerabilities are relatively superficial, easier to find (wrong code), and easier to reason and confirm; design vulnerabilities may be very hidden, more difficult to find (no "wrong" code), and more difficult to reason and confirm.

Best Practices for Discovering ZK Vulnerabilities

Based on our experience auditing and formally verifying zkVM and other ZK and non-ZK chains, here are some suggestions on how to best secure a ZK system:

Check code and design

As mentioned before, there are vulnerabilities in both ZK's code and design. Both types of vulnerabilities can lead to the compromise of a ZK system, so they must be eliminated before the system goes live. One problem with ZK systems is that any attack is more difficult to uncover and analyze than with non-ZK systems, because the details of the computation are not public or retained on-chain. So people may know that a hack has occurred, but not how it happened at a technical level. This makes the cost of any ZK vulnerability very high. Accordingly, the value of ensuring the security of a ZK system in advance is also very high.

Conduct audits and formal verification

The two vulnerabilities we present here were discovered through auditing and formal verification, respectively. One might think that if formal verification is used, auditing is unnecessary because all vulnerabilities will be discovered by formal verification. In fact, our recommendation is to perform both. As explained at the beginning of this article, a high-quality formal verification effort begins with a thorough review, inspection, and informal reasoning of the code and design; this work itself overlaps with auditing. In addition, finding and eliminating simpler vulnerabilities during audits will make formal verification easier and more efficient.

If a ZK system is to be both audited and formally verified, the best time is to do both at the same time so that auditors and formal verification engineers can collaborate efficiently (potentially finding more vulnerabilities because the objects and goals of formal verification require high-quality audit input).

If your ZK project has already been audited (thumbs up) or audited multiple times (thumbs up), our recommendation is to perform formal verification on the basis of the previous audit results. Our experience with audits and formal verification of zkVM and other ZK and non-ZK projects has repeatedly shown that verification often catches vulnerabilities that are missed in audits and are not easily discovered. Due to the characteristics of ZKP, while ZK systems should provide better blockchain security and scalability than non-ZK solutions, their own security and correctness are much more critical than traditional non-ZK systems. Therefore, the value of high-quality formal verification of ZK systems is also much higher than that of non-ZK systems.

Ensure the security of circuits and smart contracts

ZK applications usually consist of two parts: circuits and smart contracts. For zkVM-based applications, there are general zkVM circuits and smart contract applications. For non-zkVM-based applications, there are application-specific ZK circuits and corresponding smart contracts deployed on the other side of the L1 chain or bridge.

ZK Applications Circuit Smart Contracts zkVM-based zkVM Applications Non-zkVM-based Application-specific L1 Chain/Bridge

Our audit and formal verification work on zkWasm only covers zkWasm circuits. From the perspective of the overall security of a ZK application, it is also very important to audit and formally verify its smart contracts. After all, after investing so much effort in ensuring the security of the circuits, it would be a pity if we let down our guard on the smart contracts and the application is eventually compromised.

There are two types of smart contracts that deserve special attention. The first are smart contracts that deal directly with ZK proofs. Although they may not be very large in size, they are very high risk. The second are medium to large smart contracts running on top of zkVM. We know that they can sometimes be very complex, and the most valuable ones should be audited and verified, especially since people cannot see their execution details on-chain. Fortunately, after years of development, formal verification of smart contracts is now practical and ready for suitable high-value targets.

Let’s summarize the impact of Formal Verification (FV) on ZK systems and their components by explaining the following.