Cover photo

Execution Client Diversity

Call to action: Node operators should diversify their execution clients away from a majority Geth setup

Authors: Nixo(a), Thorsten Behrens(a,b), Sam Coffey(a,c)
a. EthStaker, b. CryptoManufaktur, c. POAP

Contact: team@ethstaker.cc

Call to action: Node operators should diversify their execution clients away from a majority Geth setup. In the current supermajority execution client environment, there is a significant monetary risk present for operators running only Geth.

Introduction

Ethereum is built to be a robust and resilient network with 100% uptime. In order to achieve this, any central point of failure is identified, addressed, researched, and mitigated. Decentralization of these points of failure is a multifaceted problem that requires active and ongoing attention at every layer of the stack, including but not limited to geography, software implementations, hardware in use, block building methods, staking pool protocol dominance, etc.

Ethereum client diversity is a focus of the staking community, especially among those who run nodes. Maximizing client diversity benefits the network as a whole, keeps an operator’s investment safe and robust, and minimizes the potential of a catastrophic event due to an error in a single client, which would impact every entity: individual and commercial operators alike.

This document summarizes the current state of execution clients, especially at scale, and asserts that this state is more than sufficient to begin to push for diversification into two minority execution clients, Nethermind and Besu, for larger operations. Large staking operations would be prudent to consider diversifying their validators into a mix of execution clients and away from a Geth majority.

Brief history of consensus client diversity initiatives and effects

EthStaker initiated a concerted effort in late 2021 to inform stakers and larger staking operations of the vulnerability that was created by the Prysm consensus client supermajority that existed at the time. Prysm was a robust first-mover in the consensus client software arena but, by that time, Lighthouse, Teku, and Nimbus were on par with Prysm performance, and the supermajority represented an urgent threat to the beaconchain.

Solo stakers, who are generally very engaged, quickly began switching to minority clients, and many larger staking operations followed, committing to diversification. Data and testimonials from the EthStaker community that reported their experiences with these minority clients helped enormously. Prysm soon lost its supermajority while the presence of minority clients, especially Lighthouse, began to climb. Today we are at a much healthier, though not perfectly ideal, mix of consensus clients. With five viable and robust consensus clients, we would ideally see each at an equal 20% marketshare.

Data provided by Sigma Prime's Blockprint — updated daily. Visualization by Ether Alpha’s clientdiversity.org

An incident in May of 2023 illuminated the potential vulnerability of unpredictable software bugs. An exceptional scenario caused many nodes to struggle to follow the chain and led to a delay in finality for four epochs on May 11th and for nine epochs on May 12th. The issue was assessed, diagnosed, and fixed by patches to the Prysm and Teku clients the following day. The full post-mortem can be found here.

During these issues of delayed finality, transactions continued to be processed on the network with no effects felt by the end user. This was possible because Lighthouse had a different design approach and dealt with the situation in a way that Prysm did not (Lighthouse and Prysm currently represent ~75% of the network).

Had Prysm still been a supermajority, the effects of this bug would have been far more severe for both the network and stakers, who experienced an inactivity leak for the first time in Ethereum’s history - this penalty affects non-attesting validators and grows quadratically over time. Validators that continued to attest (Lighthouse users) did not experience this penalty.

The current state of execution client diversity

Execution client diversity data is more difficult to assess than consensus client diversity data. The consensus client utilizes the execution client and publishes blocks, leaving fingerprints in the way that they order attestations, but the execution client doesn’t leave identifiable traces on-chain.

There is currently an effort by execution-diversity.info to gather self-reported data from staking providers about the execution clients that they use. Some protocols (e.g. Lido) conduct quarterly surveys where this data can be found, some (e.g. Rocket Pool) make theirs visible through graffiti, and some have reported privately.

Self-reported data from execution-diversity is summarized by Ether Alpha in private correspondence

With 44.5% of the network self-reporting, Geth comes in at 86.8%, a worrying supermajority. Until now, EthStaker has made an intentional decision to not push for execution client diversity because of the state of the other execution clients - they simply weren’t robust or reliable enough to replace Geth, especially in commercial staking operations.

Geth is a safe execution client to run until there’s a bug either in a Geth release or an exceptional scenario similar to the one that affected consensus clients in May of 2023 arises on the execution layer. The likelihood of this arising is very small, but with Geth at an estimated 86.8% share of the network, this scenario would be catastrophic.

Recommended execution clients for use at scale

Github: Geth, Nethermind, Besu
Docs: Geth, Nethermind, Besu

Other execution clients

Github: Erigon, Reth, nimbus-eth1
Docs: Erigon, Reth, nimbus-eth1

These recommendations are based on extensive testing on mainnet and testnets, within the community and reported by Cryptomanufaktur and Ethpool.

Cryptomanufaktur is a commercial node operator running thousands of keys for various entities and has been testing minority clients at scale for robustness and reliance in an effort to identify an appropriate time to push for execution client diversity and has shared findings with EthStaker. Ethpool by bitfly was known for running the largest Ethereum mining pool, Ethermine. Since genesis, they relied on Geth and Parity clients. However, after extensive testing and the recent shift of block building to relays, Ethpool has switched to running Nethermind for their thousands of keys.

EthStaker is confident recommending minority clients Nethermind and Besu for at-scale operation. While all three minority clients (Nethermind, Besu, Erigon) are being used in production with at-scale validation, Nethermind and Besu currently present the lowest risk.

Individual notes on currently available clients

Geth

The main advantage of running Geth is that it is reliable and battle-tested and has been shown to be robust under normal circumstances. It prunes offline and is working toward path-based state storage (PBSS), which will remove the need to prune entirely.

At its current supermajority status, the use of Geth presents a risk to both Ethereum as a whole and to the individual operator running it. This risk is discussed in the next section.

Nethermind

A recent release by Nethermind (v1.8) allows an operator to start attesting to blocks in ~1 hour, and download old bodies and receipts with better attestation in 4 hours. Nethermind can online prune and is also working towards PBSS.

Nodes running Nethermind were seeing an issue where the client would resource-starve a machine during prune, but this has been addressed and fixed in recent released by Nethermind and nodes running Nethermind are no longer experiencing this issue.

Besu

After the merge, there were reports that Besu was slow in processing blocks, causing missed attestations. This issue has been addressed and fixed by their latest releases. A current potential point of concern is that when it runs out of space, it needs to be resynced, which results in downtime while resyncing for that CL-EL pair.

Erigon

Erigon is stable and robust with efficient syncing but is not yet recommended for commercial stakers or larger operations.

Erigon has a history of changing command line parameters in a breaking manner without warning. Additionally, when Erigon moves to their V3 and V4, the versions will require operators using their software to complete full resyncs from scratch. These are the primary reasons for not recommending their use in larger operations at this time.

The risk presented by an execution client supermajority

How would this failure occur?

The primary focus of this memo is the risk associated with Geth, specifically the vulnerability it introduces through a supermajority. Should an extraordinary situation occur in a supermajority environment where validators using Geth are left isolated and operating on a separate chain, there is a significant monetary risk to the node operator running Geth.

In a blog post in February of 2022, developer Jonathan Cook (jmcook.eth) addressed the risk of execution client diversity in the context of consensus client diversity:

a bug affecting the execution clients can also propagate through to the consensus layer since, after the merge, the two will be coupled together with the execution payload generated by the execution clients being a core component of Beacon Blocks.

There are two significant numbers to look for here when evaluating the risk of a client’s dominance to the chain: 33% and 66%.*

33%

33% of the network experiencing failure can prevent finalization of the chain - it is not catastrophic, but it does present a potential monetary loss for validators unable to attest correctly. This is the scenario that occurred on the consensus layer where affected clients represented more than 33% of the network.

In order to encourage adoption of the multi-client architecture that Ethereum utilizes to maximize its robustness, penalties are dynamic and are highest for highly correlated errors. If one validator goes offline, penalties are minor. In the case of a non-finalizing chain, where between 33 and 66% of the network are unable to attest and Ethereum fails to finalize for some number of epochs, the chain enters a “inactivity leak” mode, where non-attesting validators are penalized a quadratically increasing amount each epoch until they have leaked enough for the attesting validators to finalize the network or until client developers are able to diagnose the issue and release a patch for the affected clients.

This penalty is 1% of the validator’s balance at ~4.5 days, 5% at ~10 days, and 20% after three weeks, as outlined in a March 2022 blog post by Dankrad Feist, a researcher at the Ethereum Foundation.

66%

There are three types of bugs to consider for a failure by a client that is being run on >66% of validators on the network, outlined in Dankrad’s above-mentioned blog post:

  • a bug that creates a double-signing event

  • a bug that causes validators running that client to be offline

  • a bug where validators running that client produce an invalid block.

With the first (very unlikely) type of bug, all validators running the supermajority client would be slashed and penalized the entire value of their validator’s balance. With the second type of bug, they would experience the inactivity leak until validators running other clients are able to finalize with their increasing portion of the network.

With the third type of bug, validators running the supermajority client would finalize an invalid chain. At this point, the majority of the network would be following this invalid chain. This would throw the network into chaos because it would require an intervention from the social layer to decide whether to recover the valid chain or continue with the invalid chain because the majority of validators followed it. Such a bug is illustrated in an example on the Kintsugi testnet in early 2022, where a missing check in two clients resulted in an invalid block being declared valid and, even after the fix was deployed, an additional related error caused validators on another client to fail to join the valid fork.

This would not only affect validators, but every application built on top of Ethereum that continues to operate during this period. A social layer intervention could choose to slash and severely penalize the affected validators, simply exit them from the network, or choose to follow a chain that isn’t valid. Any available choice here would shake faith in the immutability and reliability of Ethereum.

Geth is currently run by an estimated 86.8% of validators on the network.

*There is an additional disruptive scenario between 50 and 66% of the network, where buggy validators would still build a second chain, and it would be subject to the same inactivity leak. You can read about it in Dankrad’s blog post.

What is the probability of such a bug?

Low. Client developers test their implementations and compatibilities on testnets under a variety of circumstances and, with a number of clients in active development, it is likely that at least one will catch most bugs and work together to avoid errors on mainnet. But just as we did not anticipate the non-malicious bug that occurred on the consensus layer recently, there will always be unpredictable bugs that are caused by exceptional circumstances and we should be prepared for that scenario. In our current state of execution client diversity, such a bug would require social intervention, likely causing a monetary loss to a majority of validators, and be catastrophic for Ethereum’s reputation.

Risks of switching execution clients

For commercial operators EthStaker recommends avoiding a monoculture client setup and instead running a mix of EL and CL clients. In a mix of clients, any client pair presenting an issue can be taken out of operation without issue.

For small scale or individual operators If switching to a new execution client results in a worst case scenario, the operator will be offline for 1-2 days while they switch to another client. Offline penalties are minor and should not be a major concern for small operators.

Conclusions

An execution client supermajority represents an urgent threat to the network and the state of execution clients has now reached a point where at least two minority clients, Nethermind and Besu, can be confidently recommended by EthStaker for use in large staking operations. It is in every operator’s best interest to diversify their execution clients or move to minority clients entirely to minimize their risk.

References:

  1. Client Diversity Initiative. (https: /clientdiversity.org/)

  2. Offchain Labs. "Post-Mortem Report: Ethereum Mainnet Finality - 05/11/2023." (https: /offchain.medium.com/post-mortem-report-ethereum-mainnet-finality-05-11-2023-95e271dfd8b2)

  3. Edgington, Benjamin. "Inactivity Leaks." Eth2.0 Book. (https: /eth2book.info/capella/part2/incentives/inactivity/)

  4. Execution Diversity. (https: /execution-diversity.info/)

  5. Cook, JM. "The Importance of Client Diversity in Ethereum." (https: /mirror.xyz/jmcook.eth/S7ONEka_0RgtKTZ3-dakPmAHQNPvuj15nh0YGKPFriA)

  6. Feist, Dankrad. "Run the Majority Client at Your Own Peril." (https: /dankradfeist.de/ethereum/2022/03/24/run-the-majority-client-at-your-own-peril.html)

  7. Parithosh. "Eth2 Merge Call Notes." (https: /notes.ethereum.org/@parithosh/BkkdHWXTY)

Loading...
highlight
Collect this post to permanently own it.
ethstaker logo
Subscribe to ethstaker and never miss a post.
#ethereum#network health#client diversity#validators#staking#blockchain
  • Loading comments...