These two pieces provide an overview of core building blocks to understand decentralized training, why decentralized training is interesting, and a spotlight on the three most significant movers in the space.

I first started reading about decentralized training when it first popped up on my radar with Gensyn's seed round announcement in March 2022, so it's nice to write this as a reflection on how my thinking towards and understanding of AI development and consumer appetite for these tools writ large has evolved over the past two years.

What are Foundation Models?

Each time you query ChatGPT/Claude/Perplexity, you leverage a foundation model that delivers a response via a user interface. Your query is processed (infrastructure routes, authenticates, scales, monitors requests etc) and the system performs inference to generate a response. Foundation models generate responses by using learned patterns and relationships to predict the next item(s) in a sequence—responses are built one token at a time. This process continues until the system determines that a sufficiently complete answer has been generated.

Foundation Model Training Pipeline

Now that you understand the role of foundation models, the inevitable next question is how are these foundation models built?

Lifecycle of Foundation Model System (Paper)

Data Collection

Foundation model training process begins with data ingestion (raw data is collected from databases, files, APIs etc) from across the Internet. This data is then preprocessed to eliminate bad data.

Side note: Data is sometimes thought of as the bottleneck for improvements in foundation models but its high quality data that is actually the bottleneck, you achieve less noise and more signal that better inform the actual value created from these systems—namely, human decisions. Dive into synthetic data if you’re curious to learn more. Mary spent too much time in college processing and normalizing data for internships.

Architecture Selection

The next step is selection of some architecture that determines how the model will process information and learn patterns from the training data. This stage is intended to identify the most informative attributes for the learning algorithm and allows inherent structures in data types (natural language, images etc) to be represented by specialized techniques employed.

You may have heard of the paper Attention is All You Need. The paper proposed a deep learning model architecture comprised of encoders and decoders called the Transformer that is based on attention mechanisms (it enables the model to focus on dynamic associations between different positions, thus better capturing long-distance interdependent features in a sentence).

Pre-Training

Pre-training is computationally intensive and requires multiple parallel GPUs and meaningful energy resources. The process itself can take weeks or months depending on model and size and available computing infrastructure. This scale of computation is considered one of the most significant barriers to foundation model development and previously only well-resourced organizations are able to train models from scratch (this stage is the focus of the article today). In this stage, the model learns patterns from a dataset through self-supervised learning. The model generates its own learning objectives from the data rather than requiring labeled examples—for example, predicting words in masked text or reconstructing corrupted images.

As more and more major technology giants dedicate resources to develop foundation model systems, parameter size continues to grow (increased parameter count offers greater potential for complex tasks and better outputs) and this increase results in significant demand for computation and storage which consequently creates pressure on hardware resources and computational efficiency. Training these models takes a long time and an efficient utilization of computational resources.

This necessary increase in workload has become a challenge to circumvent given it leads to issues for model serving systems (latency, performance decline, resource bottlenecks etc). Model training and serving are both being explored—decentralized training represents one such way teams are approaching scaling model training.

Fine-Tuning

Pre-training establishes a general understanding patterns in the data. Fine-tuning adapts foundation models for specific tasks or domains by using smaller domain-specific datasets to refine the model’s capabilities for particular applications. This process enables transfer learning in which knowledge gained during pre-training is applied to new, related tasks.

Implementation

Implementation prepares models for consumer use and involves optimizing model for deployment (considerations around latency, throughput, and resource efficiency).

Interlude on DeepSeek

I’m including this section because I think it provides some pattern recognition and real world understanding on how different teams are attempting to build better foundation models while making all aspects of the development pipeline more efficient and more importantly, cheaper. DeepSeek was able to achieve competitive performance with its foundation models while using substantially less compute resources.

Data Collection

The DeepSeek team generated training data that could be automatically verified and focused on deterministic domains like mathematics where correctness is certain and unambiguous. Prioritizing high quality verifiable data set DeepSeek apart from the rest of the pack (traditional approach is inhaling the expanse of the Internet).

Architecture Selection

DeepSeek used a mixture of experts (MoE) architecture design that contains specialized neural network components that activate selectively based on specific input, in contrast to traditional transformer architectures that activate all parameters for each input. Thus, only a fraction of the model’s parameters actively compute, meaning a significant reduction in compute requirements.

Pre-Training

DeepSeek developed a sophisticated reward mechanism that identified training examples that provided the most value to the model’s performance, which allowed them to be extremely selective on how they allocated compute resources. This meant wasting compute on redundant data that wouldn’t meaningfully improve the model.

All of these improvements were borne out of need given restrictions in compute resources that the US has imposed on private companies like NVIDIA in supplying Chinese companies. Despite all of these improvements that sidestep raw compute (which is essentially the approach that Western tech companies are taking right now), DeepSeek’s founder has acknowledged that they still require more computing power.

Conclusion

This should give you a good understanding of how foundation models are built and a good start to understanding the next piece that dives into decentralized training that I will finish up tomorrow--I'm in rural Chile where wifi is sparse (this Internet cafe closes in an hour).

Appendix

A Primer on Compute Carnegie Endowment

Attention is All You Need Various Authors

Training and Serving System of Foundation Models Various Authors

Building Blocks to Understand Decentralized Training

Frontier

Building Blocks to Understand Decentralized Training

Building Blocks to Understand Decentralized Training

By No Means All Encompassing; Part 1/2

What are Foundation Models?

Foundation Model Training Pipeline

Data Collection

Architecture Selection

Pre-Training

Fine-Tuning

Implementation

Interlude on DeepSeek

Data Collection

Architecture Selection

Pre-Training

Conclusion

Appendix