clientdiversity.org execution diversity data

tl;dr:
clientdiversity.org has removed ethernodes data and now only shows execution-diversity.info data. This is because the two datasets are not comparable and don’t answer the same question.
ethernodes scrapes the network and answers “how many peers might run each client?”, which is a reasonably accurate count of number of nodes, but says nothing about number of validators on those nodes. For Ethereum, it matters that a client does not have a supermajority of validators. Whether it has a supermajority of individual nodes is not concerning to the health of the network.
execution-diversity.info better answers the question of “What is the state of client diversity on the Ethereum network?”, which is what matters for safety from a supermajority bug. 

Note: ethernodes and execution-diversity.info data are both Bitfly-affiliated projects. Thank you to Butta for letting me interrogate him about both

Why is the new data on clientdiversity.org so different?

There was recently an update to the data published on clientdiversity.org. The data that was shown before is from ethernodes.org and showed that execution layer client diversity had improved to the majority client being only ~48% of the network. This seemed to be a huge improvement over the ~70-80% range we had been seeing in months previous. However, the data source was changed to execution-diversity.info and what’s displayed now shows that the supermajority client is ~84% of the network. So why the change? Was the improvement we witnessed a phantom? Which dataset is correct?

The left side of the image is ethernodes data and the right is execution-diversity.info data

How execution-diversity.info data is obtained

execution-diversity.info methodology is simple to understand: Large operators were simply asked what clients they run. It’s assumed that these operators are honest and they have no incentive to misrepresent their setups. Only a portion of the network is covered by this manual survey (estimated 57%). This data is likely similar to other large operators on the network that aren’t covered. It may not be representative of home stakers, who likely represent a majority of nodes but a very small minority of validators.

It’s important to note that the number we’re after is client diversity across validators. So while this means that execution-diversity.info may underrepresent home stakers, it wouldn’t have a very large impact on how accurate the data is overall.

How ethernodes data is obtained

Ethernodes uses a node scraper that connects to a node that has a free peer, collects some data, and then disconnects. This means that if a node doesn’t have any free peers, it will be invisible to the scraper. There are a number of things that can influence whether a node has free peers and those factors might be more or less relevant to certain clients, which would influence how ‘visible’ they are to the scraper. For example, if a client’s default peer count is set very high, it has more available slots and may therefore be more likely to have a free peer for the scraper to connect to. In this case, that client would be overrepresented in the data.

Why this makes these datasets not comparable

As is noted in a paper published in 2022, available peer slots are scarce resources, so the ethernodes data, like the execution-diversity.info data, is only representative of a portion of the network. What portion of the network is unclear.

Are professional operations less likely to have available peer slots and thus be invisible to a scraper? That would mean that home operators would be overrepresented in the data. Home operators have a lower barrier for switching clients (less complicated setups, answering to no one for some missed attestations) and are likely to be engaged with the community and conscious of the need for client diversity, so an overrepresentation of home stakers would make client diversity look healthier than its actual state.

Maybe the node scraper gets an abundance of data when a setup is coming online from 0 - are home operators more likely to be these setups? A professional operator will add or subtract validators from an existing setup and run thousands of keys with a low number of nodes, whereas a home operator may be bringing on a node for a single validator. They’re also more likely to do something silly, like allow their hard drive to become full and need to resync the database.

So while execution-diversity.info may underrepresent home stakers, ethernodes may overrepresent home stakers, who account for a much smaller portion of the overall number of validators.

Ethernodes data doesn’t answer the question “what is the client diversity of the overall network” but rather “what clients have the most free peers”, with client diversity being only one variable in this equation.

And so,

For this reason, clientdiversity.org has removed the ethernodes data in favor of the execution-diversity.info data. Neither is perfect, and the latter will require manual check-ins to receive updates to the data but, without knowing how exactly ethernodes data correlates to overall diversity, we believe that execution-diversity.info data better highlights the state of execution layer client diversity than ethernodes data.

tl;dr:
clientdiversity.org has removed ethernodes data and now only shows execution-diversity.info data. This is because the two datasets are not comparable and don’t answer the same question.
ethernodes scrapes the network and answers “how many peers might run each client?”, which is a reasonably accurate count of number of nodes, but says nothing about number of validators on those nodes. For Ethereum, it matters that a client does not have a supermajority of validators. Whether it has a supermajority of individual nodes is not concerning to the health of the network.
execution-diversity.info better answers the question of “What is the state of client diversity on the Ethereum network?”, which is what matters for safety from a supermajority bug. 

Note: ethernodes and execution-diversity.info data are both Bitfly-affiliated projects. Thank you to Butta for letting me interrogate him about both

Why is the new data on clientdiversity.org so different?

How execution-diversity.info data is obtained

How ethernodes data is obtained

Why this makes these datasets not comparable

So while execution-diversity.info may underrepresent home stakers, ethernodes may overrepresent home stakers, who account for a much smaller portion of the overall number of validators.

clientdiversity.org execution diversity data

ethstaker

clientdiversity.org execution diversity data

clientdiversity.org execution diversity data

I thought execution layer diversity was improving?!

Why is the new data on clientdiversity.org so different?

How execution-diversity.info data is obtained

How ethernodes data is obtained

Why this makes these datasets not comparable

And so,

clientdiversity.org execution diversity data

ethstaker

clientdiversity.org execution diversity data

clientdiversity.org execution diversity data

I thought execution layer diversity was improving?!

Why is the new data on clientdiversity.org so different?

How execution-diversity.info data is obtained

How ethernodes data is obtained

Why this makes these datasets not comparable

And so,