In 2017, a group of google researchers released a groundbreaking paper called Attention is All You Need. This paper signaled the onset of the AI revolution and established the foundation for GPT and almost every model developed since.
The next few pages provide a general overview of how a transformer operates. We will examine the encoder, represented by the left-hand block in the diagram, and the decoder on the right, exploring the function of each component. Following this, we will review the advancements and modifications that have shaped modern models since the original design.
The key concept to grasp is that machine learning models are fundamentally statistical, relying on numerical data to predict outcomes. Essentially, the entire model consists of a complex array of numbers and linear algebra, performing intricate combinations of multiplications and additions to approximate the next word in a sequence. Each component that follows is simply a different element within this broader computational process.
Transformers 101
Encoder Components
Word Embeddings
Transforms input tokens (words) into dense vector representations, capturing semantic information. This is the pink input embedding part of the diagram. As you can see, the idea is to learn vector representations of words relative to their meaning.
Positional Encoding
This mechanism introduces positional information into the word embeddings, allowing the model to recognize the order of tokens within a sequence. As shown by the small wave symbol linked to the plus sign in Figure 1, the approach developed in 2017 employed a series of sine and cosine waves to encode each token’s position. Due to the varying frequencies of these waves, each token’s position could be uniquely represented, as illustrated in the figure below.
Today, positional encoding is also learned by the model rather than being predefined using a fixed schema like before. This enables the model to adaptively learn token positions based on the data, allowing for greater flexibility and accuracy in understanding positional relationships.
Self-Attention: The Crux of Transformers
Allows each token in the encoder to attend to all other tokens in the input sequence, capturing dependencies regardless of their distance apart.
Self Attention uses a unique mathematical framework that is split into three different matrices, a key, a query, and a value.
Word A has a question (query) and asks Word B, “Do you have useful information for me?”
Word B has some knowledge (key) and checks if it matches Word A’s question.
If Word B’s key is relevant to Word A’s query, Word B responds with its information (value)
Self attention allows you to understand each word’s relative importance to the others, this is also not dependent on the sequence of words and therefore highly parallelizable.
Multi-Headed Attention
Utilizes multiple attention mechanisms (heads) in parallel to capture different types of relationships and features from the input.
Essentially each Key, Query, and Value method can be responsible for different types of information
Word A asks “do you have grammatical information for me?” in one head of attention”, and asks “do you have semantic information for me?” in another.
The idea is that multiple heads allow you to encode multiple types of information
Add & Normalize (Residual Connection)
Adds the input of the sub-layer to its output (residual connection) and applies layer normalization to stabilize and accelerate training.
The reasoning behind this layer is simple, if you don’t normalize, numbers tend to exaggerate error as you keep adding computations. 99% accuracy through 100 layers gives you 36% accuracy.
To address this issue, we employ residual connections. These are represented by the arrows in the diagram that bypass the attention head and reconnect afterward. In simple terms, this means we add the original input values back to the output of the attention block, preserving essential information while allowing the model to learn from the changes introduced.
This approach allows the attention block to enhance the original values by adding new information without completely altering them. The residual connection ensures that the core data remains intact, while the model incorporates the additional insights from the attention mechanism.
Feed Forward Neural Network
A position-wise fully connected network applied to each token individually, introducing non-linearity and enabling complex transformations.
This is the “thinking” part of the model. By allowing the model to work in hyperdimensional space, you are able to encode complex feature relationships.
Add & Normalize (Residual Connection)
Adds the input of the feed-forward network to its output and applies layer normalization, maintaining gradient flow and model stability.
Same idea as the residual connection before.
Decoder Explained
There are only two differences between the encoder and decoder.
Masked Self Attention
The decoder’s job is to make a prediction - the idea behind masked self attention is to mask the future tokens so that the model only has access to previous words. Encoders are used to encode information, decoders are used to predict the next token.
Example: “The quick brown fox jumps over the lazy dog”.
The word fox should know about “The quick brown” but not about “jumps over the lazy dog” as those are future words.
GPT is a decoder only transformer and what most of us are used to when thinking about generative AI.
Cross Attention
Encoder and decoder architecture is normally used for tasks like translation, English to French etc.
The idea behind cross attention is that English Word A will ask questions about the french sentence rather than self attention, which is information about its own sentence.
This technique is not commonly employed in most GPT or LLAMA models. Instead, our focus will primarily be on models like the one depicted on the right, which consists solely of a decoder, as shown in the diagram.
Basic Mathematical Understanding
On the right is a 3D plot of a function, f(x,y). The goal of backpropagation is to find the lowest error value for this 3D function, otherwise known as gradient descent. If we only had two variables, x and y, we would move along the arrow shown to arrive at the trough indicating the lowest value of f. Back propagation in ML models is simply this, but expanded to thousands of variables instead of x and y in hyperdimensional space.
Resources for Further Understanding
This has been a brief overview of transformers, providing a general understanding of their functionality. For more in-depth learning, you are encouraged to explore the following resources.
3Blue1Brown: A video series on neural networks and transformers that helps visualize what's actually happening.
Andrej Karparthy: A video series on coding GPT 2 from scratch.
Transformers Illustrated: A good overview of how transformers work.
Backpropagation: A mathematical explanation of backpropagation, which is how the numbers in these layers are determined.