In 2023, Web3 data grew by more than 400% year-on-year, as approximated by Filecoin growth, one of the leading decentralized data storage solutions. The NFT craze in 2021 arguably kickstarted the “on-chain media” space (consumer media applications with blockchains as their backend), and we are living in a data revolution. Crypto adoption has broader implications for the media and entertainment industry, where more power is shifting to users vs platforms, and this doesn’t get as much attention as the financial implications yet.
My “onchain” Data Science journey
I started doing data science on blockchain data in 2020. It began as a weekend hobby but quickly became an obsession, so much so that I quit my AI job at AWS and started doing it full-time as the first ML hire at Chainlink Labs. I learned much during the 2021-2022 period about blockchain data structure, depth, and potential, especially in the financial markets. Just a few years ago, doing a literature review of AI papers focusing on blockchain data took only a few hours, with maybe a dozen research papers worth looking at. Nowadays there are dedicated conferences for data wizards as well as multiple academic gatherings and publications for blockchain analysis and decentralized AI.
Fast forward to the beginning of 2023, right after a few decentralized social protocols like Lens and Farcaster started getting initial adoption, I realized that a new type of data was being put on the blockchain beyond financial transactions. It felt like the beginning of something bigger for the blockchains, a universal, immutable database. The diversity of data being stored went up and most importantly, it started looking like internet data (with information on consumer web and mobile interactions on different types of media).
As a data practitioner in the traditional web (“web2”), the potential of aggregating so much Web3 data became obvious: in the Big Data age it is well known that “data has gravity,” and it is unstoppable in its growth once you “break the data silos”. Once an organization implements a data lake strategy , the next thing you know is that all the teams in that organization start creating 2x, 50x, and 100x more data than before (it is proximity to existing data that makes it easier to generate ideas about new insights that can be extracted, kinda like questions begging new questions). Even with this intuition in mind, I find myself surprised about what has been happening under the application and smart contract layers. For the past year, I have been building with a few data scientist friends custom AI models for Web3 using this new data. Wenow feel compelled to share more about these exciting trends.
Decentralized Storage Datasets
As of February 2024, there is a 1.8674 EIB stored on Filecoin and another 155 TB on Arweave, the two leading decentralized storage solutions. The growth of this data is astonishing as there is more data uploaded daily on Filecoin (5.8 PIB daily in June 2023) and Arweave (250 Tb daily in avg) than what’s uploaded to Facebook in a day. That is also more than what is uploaded daily to Twitter, Instagram, and TikTok combined! *
* estimating daily data uploaded to these based on these assumptions:
10 million tiktok 15-30 sec videos are uploaded each day, roughly 8-10 MB.
95 million photos and videos are shared on Instagram, roughly 80% images at 2-3 MB and 20% videos at 8-10 MB.
500 million tweets are sent every day on Twitter/X, roughly 20% contains an images (2-3 MB), 5% contains a video (8-10 MB), and 75% only text (280 bytes)
Filecoin vs Arweave data composition
It is important to understand what type of data has been uploaded in detail to these decentralized storage platforms, as that tells a story about the different Web3 use cases developers or “archivers” had in mind.
Filecoin, which leverages the composability of the IPFS system has seen wider adoption with a large range of institutions by running programs like Filecoin Plus: an incentivization program to attract large internet datasets. Only a proportion of those are Web3 / crypto related, and are focused on blockchain archiving and NFT.storage (500 Terabytes in total). The rest is a mix from a large research dataset in Life Sciences, Healthcare, Environment and Internet (see composition here and Messari report here).
Arweave, on the other hand with an innovative fee structure (pay once for permanent storage), and better scaling mechanisms pioneered by the likes of Irys (previously Bundlr), has seen more adoption with “DApp” (decentralized applications) developers. In particular it became the storage location for “onchain media” supporting developers building new consumer applications powered by Web3 ownership (login with your wallet, collecting posts, articles, podcasts etc). We estimate that 100 Tb (~65% of Arweave) is related to “Web3 Social”.
A great example of a decentralized social protocol that adopted the blockchain as an immutable database to guarantee “switching powers” to their users, is Lens Protocol, their team pushed the envelope of onchain data scaling by building their own L3 on Araweave called Momoka.
Data from consumer crypto applications stored on Arweave
Multi-modality social datasets
Over the last 2 years, we’ve seen the emergence of more and more applications that leverage the “onchain history” that users constructed by collecting NFTs into their wallets. This is becoming a powerful trend for two reasons:
The items that a crypto user can collect are not limited to basic profile NFTs (“PFPs”) anymore but can now span any media type imaginable. An NFT nowadays can be a social media post on Lens, a podcast on Pods, a song on Sound, a video on Odysee, or a ticket to a real event like a POAP.
This onchain history is also linked to other crypto users, effectively creating large “social graphs” without silos. As more and more applications get built on top of crypto rails, either directly like on Lens (where a follow leaves a trace because it interacts with a smart contract) or Ethereum Follow Protocol, or indirectly like in the case of Warpcast (every user get a new wallet behind the scene).
This trend has been underpinning the growth in data on Arweave where at the peak of the cycle in 2022, 7.2 TB of NFTs, and a 1.25 Tb of video were uploaded per month.
Archiving of Web2 Social protocols
It is worth noting that incentives exist for different actors to start bringing web2 data into these storage solutions for archiving purposes. We have already seen more than 1 Tb of Web2 data from Wibo, Reddit, Twitter, Nostr, Youtube, Tiktok being uploaded to Arweave.
Data volumes started picking up again
Putting the data science hat again and following the data, we can see an uptick of video uploads on Arweave this January (👀). And with protocols like Lens opening up to any users, we expect the social media posts portion to grow even more with a focus on game and events streaming.
It is amazing to see that Filecoin and Arweave have amassed an open dataset of the size of Wikipedia media dataset in only a few years, with strong guarantees around its preservation! If the amount of data 4X over the next 2 years, which is on the conservative side given the underlying developments, we should see a 500Tb Web3 social dataset on Arweave specifically eclipsing Wikipedia by a margin. Enough to train a model like ChatGPT on text alone (or 10X bigger if using all modalities, see how much data needed to train ChatGPT).
The future for Web3 Social Data
Long-form and short-term video formats
Zooming out, it seems we are at the early innings of a mega trend where 1- smart contract innovation drives data growth in ecosystems like Arweave, and 2- where one new cultural phenomenon (NFTs) has provided a way to tokenize media and has thereby driven the proliferation of new applications.
Following the steps of Web2, I believe the next wave will be driven by video enabled apps (decentralized “youtubes” & “tiktoks”), with many category leaders already emerging like Odysee which boasts some 5 million monthly active users despite all the headwinds the underlying blockchain LBRY supporting it had faced.
Actually, many famous Youtubers with millions of followers, foreseeing the censorship risks of closed off platforms have started to actively build their audiences on Odysee as a hedge, in some cases already achieving 5 to 6 figures followers counts.
AI for content moderation, personalization, and generation
As the Web3 data size grows, some unique challenges and opportunities arrise, especially looked at from a Data Scientist perspective.
Content moderation
First, the monetary aspects of crypto do attract bad actors, spammers, and low quality users (“airdrop farmers”), all which can dilute the value of the data created. Fortunately, AI techniques are showing good results in filtering out bot generated content for example, using network and semantic analysis. A caveat to that is that you need a good enough ground truth dataset to be collected.
At mbd, we have been running LLM like models fine-tuned on 100m X/Twitter tweets to analyze the Farcaster ecosystem. We can say that Web2 AI models need to be adapted to Web3 as the cultural norms are different.
Content Personalization
Second, as the data size grows discoverability becomes a problem, especially because decentralized storage solutions are hard to index and were not designed for “personalized read”. Luckily, here again, many techniques related to recommendation systems have been pioneered by social media companies over the past 20 years that make mining this data efficient.
AI can be used to understand this vast lake of data and serve as an assistant in exploring it. This has the potential to surface many of the great conversations happening in Web3 now that majority of people don't know about.
Our early results leveraging the wider crypto association between users as they collect NFTs, posts, and articles are promising. Custom models fine tuned on Farcaster data, can predict what people will like, share, or reply to with a high accuracy, making creating algorithmic feeds that can generate engagement a reality for Web3 developers for the first time.
The challenge here, is to try and avoid the pitfalls of using machine learning blindly but to empower the developers and users to explore the “algorithmic feed design space” and offer different discovery mechanisms that align with their values or their community values.
Content Generation
Last, we live in a data hungry world where the competitive advantage between AI models does not lie in their sizes or the amount of compute available to companies (after a certain hurdle is passed) but in the quality and uniqueness of the the datasets trained on.
Building content generation models tailored to the Web3 audience as well as complementing more mainstream ones with this data to help brand appeal to a growing audience is an awesome opportunity (that I am excited to be working on!). Especially when you couple AI training with incentive mechanisms to keep solving the long-tail AI problem, which are concepts Web3 developers have pioneered and excelled at.
Have feedback?
You can reach me via email at yl@mbd.xyz or on Farcaster at yassinelanda.eth.
About me: I am the founder of mbd. A crypto and AI research lab focused on building a decentralized AI recommendation system for web3. Went through a16z crypto CSS23, previously data science at Chainlink Labs & AWS.