Clustering Analysis on Zeta <> Pyth Transaction Logs

Intro

Previous research looked at transaction faliure rates on Solana and identified higher than average failure rates with the Pyth pull oracle contracts. Specifically the signer contract Zeta seems to be failing with many of its price updates, leading to upwards of 20 SOL a week in lost fees from submitting failed transactions.

This post will follow up, examining the on-chain sample data (~1000 blocks is the data cap) more closely with Polars DataFrame and Agglomerative Clustering to analyze log message patterns.

Data is abstracted using the Clickhouse python client from Cryptohouse. Once extracted, data is transformed via Polars. Polars allows for complex data transformations and powerful exploratory analysis by structuring data in DataFrames. It excels at handling arrays, unnesting columns, reusing transformations, and supports ad-hoc visualizations—all within a single notebook.

The notebook for all of the code can be found here.

Jito Blocks

Compared to the average 8% transaction failure rate on Solana, transactions that utilize Jito have much lower failure rates, around 0.8%. The majority of Zeta transactions do not go in Jito bundles. When using Jito, the two charts below show that the jito transactions experience an overall lower failure rate.

Although utilizing Jito transactions shows a correlative relationship for lower failure rates, it's not clear whether there is a causation with the current data. Noticeably there are still failure rate spikes in the Jito blocks which would be data point outliers that could be explored at a later time.

Log Message Text Clustering

Agglomerative clustering is chosen for this analysis to group similar log messages together because it excels at identifying natural groupings in data without requiring a predefined number of clusters. This is particularly useful when exploring log messages, where the goal is to uncover patterns or anomalies that might not be immediately apparent from a large selection of verbose log messages.

Two clusters are generated, a failed log cluster and a successful log cluster to provide a more succinct overview of the log messages.

It's an interesting observation that the successful log cluster tends to have a lot more addresses. One hypothesis is that if the transactions are successful, more accounts end up being utilized. In contrast, failed transactions might fail earlier in the process, not utilizing the maximum number of accounts.

The raw data from these word clouds can be found here:

log messages in failed transactions

log messages in successful transactions

Intro

This post will follow up, examining the on-chain sample data (~1000 blocks is the data cap) more closely with Polars DataFrame and Agglomerative Clustering to analyze log message patterns.

The notebook for all of the code can be found here.