Cover photo

Clustering Farcaster Users By Their Onchain Portfolio

Data deep dive into indexed Farcaster onchain balances

The Minimum Viable Model

In the lead up to our closed beta, we have been busy adding new functionality to AskGina. In particular, we hope to provide powerful recommendations and insights that are specifically tailored to the user prompting the bot. This is reliant on us being able to successfully capture:

  1. Intent: The meaning a user is trying to convey via their prompt.

  2. Context: Understand who the user prompting our bot is.

This sprint we've developed our first "minimum viable model" for user categorisation. By grouping users based on their onchain holdings, we hope to give users more personalised bot responses, assuming that users with similar holdings have similar interests.

The Data

The process starts with our daily indexing job, which crawls the 10,000 most-followed Farcaster users verified addresses and looks for holdings on Base. We capture not only token contract addresses and balances, but also use a third-party API to retrieve each token balance's value in USD, as well as other qualitative information about each token balance. Finally, we calculate a total value of assets in USD for each user.

The end result is a timeseries of portfolios for the top 10,000 Farcaster accounts by following.

The Curse of Dimensionality

There are 712 different tokens held by the top 10,000 users.

Some tokens, such as Ski Mask Miggles, are held by a single user who only owns 158 USD worth; whereas others, such as Degen and Ethereum, are held by 1000s of users in values up to low 7-figures. This results in 97.6% of our pivoted dataframe being zero values and thus our design matrix is extremely sparse.

To effectively group our users, we need to tackle two challenges: high dimensionality and data sparsity. Enter Principal Component Analysis. This technique allows us to reduce the data's complexity while preserving enough information to group users effectively.

Through principal component analysis, we find that:

  • The first component explains 57% of variation in the data

  • The first 10 components explain 80% of variation in the data

  • The first 50 components explain 93% of variation in the data

We use column-wise shuffled data to create a random benchmark for component variance. Comparing this to our actual data reveals a critical threshold—the point where real variance dips below random variance. For our dataset, this occurs after just four components, allowing us to focus on these key dimensions for further analysis.

Thus we effectively reduce the dimensions of our data by a factor of 178x.

The Groups

Finally, we use our own clustering algorithm to assign users to clusters. Given that we are clustering over centred vectors, the majority of our data lies close to the origin. As such, we initialise the majority of our clusters close to the origin through the use of a skewed multivariate Gaussian distribution.

Each pairwise combination of principal components plot against each other

The final output consists of 373 distinct user clusters, each representing a group of users with similar onchain holdings. To visualise this four-dimensional dataset, we've created colour coded plots. While these plots offer valuable insights, it's important to note the inherent limitations in representing four dimensions in a visual format.

The Closed Beta

If you'd like to use AskGina for yourself, we'll be opening a closed beta to the public via a whitelist. More details on our closed beta and the whitelisting process will be released via this paragraph, so subscribe to stay tuned for future updates.

Loading...
highlight
Collect this post to permanently own it.
AskGina logo
Subscribe to AskGina and never miss a post.
#data