Farcaster Data as a Public Good

This post is jointly authored with Jacob, based on a research project we are working on.

The future is decentralized, social, and built on your social graphs!
If you have been following the rapid rise of Farcaster, a sufficiently decentralized social network protocol, you might have noticed what is possible when the user controls their data and when the data is open. A rich set of possibilities emerge on what you can do with your data as you take it with you, and what others can build on this open data(graph).
You can read an excellent introduction to farcaster by Kerman Kohli before it was all the rage and you can read what my research team has written about Farcaster's recent crazy growth.

Given its growth, there is interest by new players in the ecosystem to leverage and build. One way to do that is to replicate the data for yourself (permissionlessly for the cost of hardware/compute) and build on it, as it operates using a hybrid infrastructure where identity is on chain and data is off chain. The data is stored in hubs, and that provides a rich set of data that can be used for analysis by researchers such as myself, and builders to build apps. Some early providers of API's to build on top of Farcaster such as Neynar and Airstack, provide a nice set of tools and will charge for it. We maintain a hub and it costs about $120 dollars a month to keep it active on MS Azure and we use it for our research internally.

What if we could find a way to put this data on a decentralized storage system, put an interface on it for API access, gate it with a smart contract that charges for access to the API, and feeds a treasury contract that can be used to allocate funds to the storage contract? Can we do this autonomously, after initial setup? That is where we started our research, along with Jacob and initially looked at Filecoin/IPFS/Bacalhau/Lilypad, though it is likely we could consider the Graph protocol as well, though it only indexes on chain data for now.

Here is an overview of an initial architecture for doing this, though without the smart contracts baked in and we need one entity to control the process. The architecture integrates a self-hosted component, data transformation processes, distributed storage, and computational interfaces to manage and analyze Farcaster network data effectively.

Self-Hosted Component:

We operate a Farcaster Hub instance as well as its Replicator module.

- The Hub synchronizes with other Hubs to collect Farcaster messages and updates Farcaster account registrations through Ethereum and Optimism blockchains using RPC endpoints.

- Our Replicator conducts primary ETL (Extract, Transform, Load) operations to populate a PostgreSQL database with the retrieved data, ensuring data integrity and timeliness.

Data Chunking Process:

We execute a secondary ETL process on a scheduled basis, leveraging the Compute Over Data approach through Bacalhau and Lilypad .

- The process queries the PostgreSQL database for recent changes, encapsulating these updates into daily segments and appending them to Parquet files, maintaining an immutable history of modifications.

- This methodology eliminates the need for dedicated infrastructure as computational tasks are containerized and executed in a distributed manner.

Distributed Storage Solution:

Subsequent to the ETL process, we store the processed data on the Filecoin network.

- Data is packaged into Content Addressable aRchives (CAR) and uploaded to IPFS. Storage deals are then negotiated on Filecoin via the Filecoin Virtual Machine (FVM) using the provided Content Identifier (CID).

Data Access and Computation Interface:

The information stored on Filecoin is made accessible for queries and computational tasks. We dont have a proposal for a payment layer on this yet. This is where we want to explore using a smart contract that can charge for the analysis/access, and route those funds to a treasury that can fund the storage contract on Filecoin.

- SQL Interface: Users can execute SQL queries on the stored data by deploying a Docker container that spins up a DuckDB instance, conducts the query, and outputs the results. This allows users to perform custom SQL queries via Bacalhau/Lilypad services.

- Data Analytics: We also offer specialized Docker containers for predefined analytical operations. Potential applications include generating tailored news feeds based on temporal and thematic criteria and leveraging Language Model integrations for tasks like extracting message vector embeddings.

This architecture empowers users to harness the full potential of Farcaster data, facilitating a range of analytical tasks without the overhead of managing complex infrastructure.

The other option is to explore the graph protocol for a solution like this, as they have a well developed decentralized option for storing and indexing on-chain data and we have started discussions.

If you have ideas and suggestions, and/or funds to enable this research 😀 , please reach out in the comments here or on warpcast/twitter @vishalsachdev or on Linkedin.

Farcaster Data as a Public Good

Web3Doc

Farcaster Data as a Public Good

Farcaster Data as a Public Good

Can we keep it accessible forever?