原文标题:《Should Ethereum be okay with enshrining more things in the protocol?》
Original article by: Vitalik Buterin
Original translation: Nian Yin Sitang, Odailynews
Special thanks to Justin Drake, Tina Zhen, and Yoav Weiss for their feedback and reviews.
From the beginning of the Ethereum project, there has been a strong philosophy of trying to make core Ethereum as simple as possible, and to achieve this as much as possible by building protocols on top of it. In the blockchain space, the "build on L1" vs. "focus on L2" debate is often thought to be primarily about scaling, but in reality, there are similar issues in meeting the needs of multiple Ethereum users: digital asset exchange, privacy, usernames, advanced encryption, account security, censorship resistance, front-running protection, and so on. However, recently there have been some cautious ideas willing to enshrine more of these features into the core Ethereum protocol.
This post will dive into some of the philosophical reasoning behind the original minimal encapsulation philosophy, as well as some recent ways of thinking about these ideas. The goal will be to begin to establish a framework for better identifying possible goals where encapsulating certain functionality might be worth considering.
Early Philosophy on Protocol Minimalism
In the early history of what was then known as “Ethereum 2.0,” there was a strong desire to create a clean, simple, and beautiful protocol that tried to build as little as possible on its own and left almost all of this work to the user. Ideally, the protocol would be just a virtual machine, and validating a block would be just a virtual machine call.
The "state transition function" (the function that processes a block) will just be a single VM call, and all other logic will happen through contracts: some system-level contracts, but mostly user-provided contracts. A very nice feature of this model is that even an entire hard fork can be described as a single transaction to the block processor contract, which will be approved by off-chain or on-chain governance and then run with upgraded permissions.
These discussions in 2015 are particularly applicable to two areas we consider: account abstraction and scaling. In the case of scaling, the idea is to try to create a maximally abstract form of scaling that feels like a natural extension of the diagram above. Contracts can call data that most Ethereum nodes don't store, and the protocol will detect this and resolve the call through some very general extended computation function. From the perspective of the virtual machine, the call will go to some separate subsystem and then magically return the correct answer some time later.
We briefly explored this line of thought but quickly abandoned it because we were too focused on proving that any kind of blockchain scaling was possible. Although we’ll see later that the combination of data availability sampling and the ZK-EVM means that one possible future for Ethereum scaling actually looks pretty close to this vision! With account abstraction, on the other hand, we knew from the outset that some kind of implementation was possible, so research immediately began trying to make something as close to the pure starting point of “a transaction is just a call” as possible.
There is a lot of boilerplate code between processing a transaction and making the actual underlying EVM call from the sender address, and more after that. How can we reduce this to as close to zero as possible?
One of the main codes here is validate_transaction(state, tx), which is responsible for checking that the transaction's nonce and signature are correct. From the beginning, the actual goal of the account abstraction has been to allow users to replace basic non-incremental validation and ECDSA validation with their own validation logic, so that users can more easily use features such as social recovery and multi-signature wallets. Therefore, finding a way to re-architect apply_transaction as a simple EVM call is not just a "clean code for the sake of clean code" task; instead, it is about moving the logic into the user's account code, giving users the flexibility they need.
However, insisting that apply_transaction contain as little fixed logic as possible ultimately leads to a lot of challenges. We can look at one of the earliest account abstraction proposals, EIP-86.
If EIP-86 were included as is, it would reduce the complexity of the EVM at the expense of massively increasing complexity in other parts of the Ethereum stack, requiring essentially the same exact code to be written elsewhere, except introducing entirely new classes of weirdness, such as the possibility that the same transaction with the same hash could appear multiple times in the chain, not to mention the multiple invalidation problem.
Multiple invalidation problems in the account abstraction. One transaction included on-chain can invalidate thousands of other transactions in the mempool, making it easy to cheaply flood the mempool.
Since then, account abstraction has evolved in stages. EIP-86 later became EIP-208 and finally the actual workable EIP-2938.
However, EIP-2938 is far from minimalistic. It includes:
New transaction types
Three new trading-scoped global variables
Two new opcodes, including the very awkward PAYgas opcode for handling gas price and gas limit checks, as an EVM execution breakpoint, and for temporarily storing ETH for one-time fee payments
A complex set of mining and relay strategies, including a list of banned opcodes during the transaction validation phase
In order to achieve account abstraction without involving Ethereum core developers (who were focused on optimizing Ethereum clients and implementing the merge), EIP-2938 was ultimately re-architected as ERC-4337, which is completely out-of-protocol.
Because this is an ERC, it does not require a hard fork and technically exists “outside the Ethereum protocol.” So… problem solved? Turns out not to be. The current medium-term roadmap for ERC-4337 actually involves eventually turning large parts of ERC-4337 into a series of protocol features, which is a useful guiding example of why this path might be considered.
Wrapping ERC-4337
Several key reasons were discussed for eventually reintegrating ERC-4337 into the protocol:
Gas efficiency: Any operation performed inside the EVM incurs some level of VM overhead, including inefficiencies when using gas-expensive features like storage slots. Currently, these additional inefficiencies add up to at least 20,000 gas, and often more. Putting these components into the protocol is the easiest way to eliminate these issues.
Code Bug Risk: If the ERC-4337 "entry point contract" had a sufficiently terrible bug, all ERC-4337 compatible wallets could see all of their funds drained. Replacing the contract with an in-protocol function creates an implicit responsibility to fix the code bug via a hard fork, thereby eliminating the risk of draining funds for users.
Support for EVM opcodes like txt.origin. ERC-4337 natively makes txt.origin return the address of a "bundler" that packages a set of user operations into a transaction. The native account abstraction can solve this problem by making txt.origin point to the actual account that sent the transaction, making it work the same as EOA.
Censorship resistance: One of the challenges of the proposer/builder separation is that it becomes easier to censor a single transaction. In a world where the Ethereum protocol can identify single transactions, this problem can be greatly alleviated by inclusion lists, which allow proposers to specify a list of transactions that must be included in the next two slots in almost all cases. But out-of-protocol ERC-4337 encapsulates "user actions" in a single transaction, making user actions opaque to the Ethereum protocol; therefore, the inclusion lists provided by the Ethereum protocol will not provide censorship resistance to ERC-4337 user actions. Encapsulating ERC-4337 and making user actions a "proper" transaction type will solve this problem.
It’s worth mentioning that in its current form, ERC-4337 is significantly more expensive than “basic” Ethereum transactions: transactions cost 21,000 gas, while ERC-4337 costs around 42,000 gas.
In theory, it should be possible to tweak the EVM gas cost system until the in-protocol cost and the cost of accessing storage outside of the protocol match; there's no reason why transferring ETH should cost 9000 gas when other kinds of storage editing operations are cheaper. In fact, two of the EIPs related to the upcoming Verkle tree conversion actually attempt to do this. But even if we did that, there's an obvious reason why wrapped protocol functionality will inevitably be much cheaper than EVM code, no matter how efficient the EVM becomes: wrapped code doesn't need to pay gas to be preloaded.
A fully functional ERC-4337 wallet is large, with this implementation compiled and put on-chain taking up about 12,800 bytes. Sure, you could deploy this code once and use DELEGATECALL to allow every single wallet to call it, but that code would still need to be accessible in every block that used it. Under the Verkle tree gas cost EIP, 12,800 bytes would be made up of 413 chunks, and accessing those chunks would cost 2x the witness branch_cost (for a total of 3,800 gas) and 413x the witness chunk_cost (for a total of 82,600 gas). That doesn’t even begin to mention the ERC-4337 entrypoint itself, which, as of version 0.6.0, is 23,689 bytes on-chain (about 158,700 gas to load under the Verkle tree EIP rules).
This leads to a problem: the gas cost of actually accessing this code has to be amortized across transactions somehow. The current approach used by ERC-4337 is not very good: the first transaction in a bundle consumes a one-time storage/code fetch cost, making it much more expensive than other transactions. In-protocol encapsulation would allow these public shared libraries to become part of the protocol, freely accessible to everyone.
What can we learn from this example, and when to use encapsulation more generally?
In this example, we saw some different rationales for encapsulating aspects of the account abstraction in the protocol.
Market-based approaches that “push complexity to the edge” are most likely to fail when fixed costs are high. Indeed, the long-term account abstraction roadmap looks like there are a lot of fixed costs per block. 244, 100 gas to load standardized wallet code is one thing; but aggregation could add hundreds of thousands of gas for ZK-SNARK verification, as well as on-chain costs for proof verification. There is no way to charge users for these costs without introducing massive market inefficiencies, and turning some of these features into protocol features that are freely accessible to everyone could be a good way to solve this problem.
Community-wide response to code bugs. If some piece of code is used by all users or a very wide range of users, then it often makes more sense for the blockchain community to take responsibility for hard forking to fix any bugs that arise. ERC-4337 introduces a large amount of globally shared code, and in the long run, it is undoubtedly more reasonable to fix a bug in the code via a hard fork than to cause users to lose a large amount of ETH.
Sometimes, stronger forms of this can be achieved by leveraging protocol features directly. The key example here is in-protocol censorship resistance features like include lists: in-protocol include lists can provide better censorship resistance than out-of-protocol methods, and in order for user-level operations to truly benefit from in-protocol include lists, individual user-level operations need to be "readable" by the protocol. Another lesser-known example is the 2017-era Ethereum Proof of Stake design that had account abstraction for stake keys, which was abandoned in favor of wrapper BLS because BLS supports an "aggregation" mechanism that must be implemented at the protocol and network level, which can make processing large numbers of signatures more efficient.
But it’s important to remember that even account abstraction within the wrapper protocol is still a massive “de-encapsulation” compared to the status quo. Today, top Ethereum transactions can only be initiated from externally owned accounts (EOAs) that are verified using a single secp 256 k 1 elliptic curve signature. Account abstraction eliminates this and leaves verification conditions up to the user to define. So in this story about account abstraction, we also see the biggest argument against encapsulation: flexibility to meet the needs of different users.
Let’s flesh out this story further by looking at a few other examples of features that have recently been considered for encapsulation. We’ll focus specifically on: ZK-EVM, proposer-builder separation, private mempools, liquidity staking, and new precompiles.
Package ZK-EVM
Let’s turn our attention to another potential wrapper target for the Ethereum protocol: ZK-EVM. Currently, we have a large number of ZK-rollups that all have to write fairly similar code to verify the execution of similar Ethereum blocks in ZK-SNARKs. There is a fairly diverse ecosystem of independent implementations: PSE ZK-EVM, Kakarot, Polygon ZK-EVM, Linea, Zeth, and many more.
A recent controversy in the EVM ZK-rollup space has to do with how to handle possible bugs in the ZK code. Currently, all of these systems in operation have some form of "security council" mechanism that can control the proof system in the event of a bug. Last year, I tried to create a standardized framework to encourage projects to clarify how much trust they have in the proof system, and how much trust they have in the security council, and gradually reduce the power of this organization over time.
In the medium term, rollups may rely on multiple attestation systems, with the Security Council having power only in the extreme case of a disagreement between two different attestation systems.
However, there is a sense that some of this work feels redundant. We already have the Ethereum base layer, which has an EVM, and we already have a working mechanism for dealing with bugs in the implementation: if there is a bug, clients are updated to fix it, and the chain continues to operate. Blocks that seemed finalized from the perspective of the client with the bug will no longer be finalized, but at least we won't see users lose funds. Similarly, if rollups just want to maintain a role equivalent to the EVM, then they need to implement their own governance to constantly change their internal ZK-EVM rules to match upgrades to the Ethereum base layer, which feels wrong because ultimately they are built on top of the Ethereum base layer itself, which knows when to upgrade and according to what new rules.
Since these L2 ZK-EVMs essentially use the exact same EVM as Ethereum, can we somehow incorporate “verifying EVM execution in ZK” into the protocol functionality and handle anomalies like bugs and upgrades by applying Ethereum’s social consensus, just like we already do for the base layer EVM execution itself?
This is an important and challenging topic.
One possible topic of debate regarding data availability in the native ZK-EVM is statefulness. ZK-EVMs can be much more data efficient if they don’t need to carry around “witness” data. That is, if a particular piece of data has already been read or written in a previous block, we can simply assume that the prover has access to it and doesn’t have to make it available again. This goes beyond just not reloading storage and code; it turns out that if a rollup compresses data correctly, stateful compression can save up to 3x the data compared to stateless compression.
This means that for ZK-EVM precompiles we have two options:
1. Precompiles require all data to be available in the same block. This means the prover can be stateless, but it also means that a ZK-rollup using such a precompile is much more expensive than a rollup using custom code.
2. Precompilation allows pointers to data used or generated by previous executions. This makes ZK-rollup close to optimal, but it is more complex and introduces a new state that must be stored by the prover.
What can we learn from this? There’s a good case for wrapping ZK-EVM verification in some way: rollups are already building their own custom versions, and it feels wrong that Ethereum is willing to put the weight of its multiple implementations and off-chain social consensus on L1 to execute the EVM, but L2 that does the exact same job has to implement complicated gadgets involving the Security Council. But on the other hand, there’s a big catch in the details: there are different versions of the ZK-EVM, with different costs and benefits. The stateful vs. stateless distinction only scratches the surface; trying to support “almost-EVM” with custom code that has been proven by other systems could expose a much larger design space. Wrapping the ZK-EVM, then, brings both promise and challenges.
Package proposer and builder separation (ePBS)
The rise of MEV makes block production a large-scale economic activity, with sophisticated actors able to produce blocks that generate more revenue than the default algorithm, which simply watches the memory pool of transactions and includes them. So far, the Ethereum community has attempted to address this problem by using out-of-protocol proposer-builder separation schemes such as MEV-Boost, which allows regular validators ("proposers") to outsource block construction to specialized actors ("builders").
However, MEV-Boost makes trust assumptions in a new class of participants, called relays. Over the past two years, there have been many proposals to create "wrapped PBSs". What are the benefits of doing so? In this case, the answer is very simple: PBSs built by directly using protocol functions are more powerful (in the sense of having weaker trust assumptions) than PBSs built without them. This is similar to the case of wrapped in-protocol price oracles - although in this case, there are strong objections as well.
Encapsulate private memory pool
When a user sends a transaction, it is immediately public and visible to everyone, even before it is included on-chain. This makes users of many applications vulnerable to economic attacks, such as front-running.
Recently, there have been a number of projects dedicated to creating “private memory pools” (or “encrypted memory pools”), which encrypt users’ transactions until they are irreversibly accepted into a block.
The problem, however, is that such a scheme requires a special kind of encryption: to prevent users from flooding the system and decrypting it first, the encryption must be automatically decrypted once the transaction has been truly and irreversibly accepted.
There are various techniques with different tradeoffs for implementing this form of encryption. Jon Charbonneau described it well:
Encryption for centralized operators, such as Flashbots Protect.
Time-locked encryption, which is a form of encryption that can be decrypted by anyone after a certain sequence of calculation steps and cannot be parallelized;
Threshold encryption, trusting an honest majority committee to decrypt data. See the closed beacon chain concept for specific suggestions.
Trusted hardware, such as SGX.
Unfortunately, every encryption method has different weaknesses. While for every solution there is a subset of users willing to trust it, no solution is trusted enough for it to actually be accepted at Layer 1. Therefore, at least until latency encryption is perfected or some other technological breakthrough is achieved, encapsulating anti-front-end functionality at Layer 1 seems to be a difficult proposition, even if it is a valuable enough feature that many application solutions have already emerged.
Encapsulated Liquidity Staking
A common request from Ethereum DeFi users is to be able to use their ETH for both staking and as collateral in other applications. Another common request is simply for convenience: users want to be able to stake without the complexity of running a node and keeping it always online (and protecting the staking keys online).
Until now, the simplest staking “interface” that meets both needs has been just an ERC 20 token: convert your ETH to “staked ETH”, hold it, and convert back. In fact, liquidity staking providers like Lido and Rocket Pool have already started doing this. However, there are some natural centralization mechanisms at work for liquidity staking: people naturally flock to the largest version of Staked ETH because it is the most familiar and liquid.
Each version of staked ETH needs to have some mechanism to determine who can become a bottom-layer node operator. It can’t be unlimited, because then attackers can join and use users’ funds to expand the attack. Currently, the top two are Lido, which has a DAO whitelisted node operators, and Rocket Pool, which allows anyone to run a node with a deposit of 8 ETH. The two approaches have different risks: the Rocket Pool approach allows an attacker to launch a 51% attack on the network and force users to pay most of the costs; as for the DAO approach, if a staked token becomes dominant, it will lead to a single, potentially attackable governance gadget controlling a large portion of all Ethereum validators. To be sure, protocols like Lido have implemented countermeasures, but one layer of defense may not be enough.
In the short term, one option is to encourage ecosystem participants to use a diverse range of liquidity staking providers to reduce the possibility of a single dominant provider bringing systemic risk. However, in the long term, this is an unstable balance, and over-reliance on moral pressure to solve the problem is dangerous. A natural question arises: does it make sense to encapsulate some kind of functionality in the protocol to make liquidity staking less centralized?
The key question here is: what kind of in-protocol functionality? The problem with simply creating an in-protocol fungible “staked ETH” token is that it either has to have a wrapper around Ethereum-wide governance to choose who gets to run nodes, or be open, which would turn it into a tool for attackers.
One interesting idea is Dankrad Feist’s article on maximizing liquidity staking. First, let’s bite the bullet and assume that if Ethereum were to be 51% attacked, perhaps only 5% of the attacked ETH would be slashed. This is a reasonable tradeoff; with over 26M ETH currently staked, the cost of attacking a third of it (~8M ETH) would be excessive, especially given how many “out-of-model” attacks could be accomplished at a lower cost. In fact, similar tradeoffs have been explored in the “super committee” proposal to implement single-slot finality.
If we accept that only 5% of the attacked ETH is slashed, then over 90% of the staked ETH will not be affected by slashing, so they can be used as fungible liquidity staking tokens within the protocol and then used by other applications.
This path is interesting. But it still leaves the question: what exactly is being wrapped? Rocket Pool operates in a very similar way: each node operator provides some funds, and liquidity stakers provide the rest. We can simply adjust some constants to limit the maximum slashing penalty to 2 ETH, and Rocket Pool's existing rETH will become risk-free.
There are other clever things we can do with simple protocol tweaks. For example, suppose we wanted a system where there were two “layers” of staking: node operators (with high collateral requirements) and depositors (with no minimum collateral requirements, who can join and leave at any time), but we still wanted to prevent centralization of node operators by giving a randomly sampled committee of depositors powers, such as suggesting a list of transactions that must be included (for censorship resistance reasons), controlling fork choice during periods of inactivity leaks, or requiring signatures on blocks. This could be done in a largely out-of-protocol way by tweaking the protocol to require each validator to provide (i) a regular staking key, and (ii) an ETH address that can be called upon to output a secondary staking key between each slot. The protocol would grant powers to both keys, but the mechanism for selecting the second key in each slot could be left to the staking pool protocol. It may still be better to encapsulate some functionality directly, but it’s worth noting that this kind of “include some things, leave others to the user” design space exists.
Encapsulate more precompiled
Precompiles (or "precompiled contracts") are Ethereum contracts that implement complex cryptographic operations, with their logic implemented natively in client code, rather than in EVM smart contract code. Precompilations were a compromise made at the beginning of Ethereum development: since the overhead of the virtual machine was too large for some very complex and highly specialized code, we could implement some key operations that are valuable to important applications in native code to make them faster. Today, this basically includes some specific hash functions and elliptic curve operations.
There is a push to add a precompile for secp 256 r 1 , a slightly different elliptic curve than secp 256 k 1 used for basic Ethereum accounts, which is widely used to improve wallet security because it is well supported by trusted hardware modules. In recent years, the community has also pushed to add precompiles for BLS-12-377, BW 6-761, generalized pairing, and other features.
A counter-argument to these calls for more precompilations is that many of the precompilations that were added previously (e.g., RIPEMD and BLAKE) ended up being used far less than expected, and that we should learn from that. Rather than adding more precompilations for specific operations, we should perhaps focus on a more modest approach based on ideas like the EVM-MAX and dormant-but-always-resumable SIMD proposals, which would enable EVM implementations to execute a broad class of code at a lower cost. Perhaps even existing little-used precompilations could be removed and replaced with (inevitably less efficient) EVM code implementations of the same functions. That said, it is still possible that there are specific crypto operations that are valuable enough to be accelerated that it makes sense to add them as precompilations.
What have we learned from all this?
The desire to encapsulate as little as possible is understandable and good; it stems from the Unix philosophical tradition of creating minimal software that can be easily adapted to the varying needs of users and avoid the curse of software bloat. However, blockchains are not personal computing operating systems, but social systems. This means that it makes sense to encapsulate certain functionality in the protocol.
In many cases, these other examples are similar to what we saw with account abstraction. But we also learned some new lessons:
Encapsulating functionality can help avoid centralization risks in other areas of the stack:
Often, keeping the base protocol minimal and simple pushes complexity to some extra-protocol ecosystem. From a Unix philosophy perspective, this is fine. However, sometimes there is a risk that the extra-protocol ecosystem will become centralized, usually (but not only) because of high fixed costs. Encapsulation can sometimes reduce de facto centralization.
Encapsulating too much can overly expand the trust and governance burden of the protocol:
This is the theme of the previous article on “Don’t Overload Ethereum Consensus”: if encapsulating a specific functionality weakens the trust model and makes Ethereum as a whole more “subjective”, this weakens Ethereum’s trusted neutrality. In these cases, it is better to make the specific functionality a mechanism on top of Ethereum rather than trying to introduce it into Ethereum itself. Here, the crypto memory pool is the best example, which may be a bit difficult to encapsulate, at least until the delayed cryptography improves.
Encapsulating too many things can make the protocol too complex:
Protocol complexity is a systemic risk, and adding too many features to a protocol increases this risk. Precompiles are the best example.
In the long run, encapsulating functionality can be counterproductive because user needs are unpredictable:
A feature that many people think is important and will be used by many users may not be used very often in practice.
Additionally, the cases of liquidity staking, ZK-EVM, and precompiles show the possibility of a middle path: minimal viable enshrinement. Rather than encapsulating the entire functionality, protocols can include specific pieces that solve key challenges, making the functionality easy to implement without being too paranoid or too narrow. Examples include:
Rather than encapsulating a complete liquidity staking system, it is better to change the staking penalty rules to make trustless liquidity staking more feasible;
Rather than wrapping more precompilers, wrap EVM-MAX and/or SIMD to make a wider class of operations easier to implement efficiently;
Instead of encapsulating the entire concept of rollup, one can simply encapsulate EVM verification.
We can expand the previous diagram as follows:
Sometimes it makes sense to unwrap something, and removing rarely used precompiles is one example. Account abstraction as a whole, as mentioned earlier, is also an important form of unwrap. If we want to support backwards compatibility for existing users, the mechanism may actually be surprisingly similar to that of unwrapping precompiles: one of the proposals is EIP-5003, which would allow EOAs to convert their accounts into contracts with the same (or better) functionality.
The trade-off between which features should be brought into the protocol and which should be left to other layers of the ecosystem is a complex one that will hopefully continue to improve over time as our understanding of user needs and the suite of available ideas and technologies improves.