I’ve spent years learning Spanish, and yet my skills are still rusty. So, I became curious. How long would it take to train a transformer to learn the same task? Spoiler Alert: It does surprisingly well for an overnight training job on minimal hardware (my MacBook Pro). In this post, we’ll dive into the practical aspects of how I built the transformer. A lot of the theory is discussed in the Attention is All You Need paper. You can find my full implementation of the English -> Spanish translation task here.
Data
We use the Helsinki-NLP/opus-100 dataset,
an English-centric multilingual corpus covering 100 languages. For our work, we use the en-es
subset which has English -> Spanish sentence pairs.
Tokenization
Our tokenization strategy uses Byte Level pre-tokenization coupled with a Byte-Pair Encoding
Tokenizer. We’ll explore how this works by looking at how we’d tokenize the string Hello World.
Byte Level pre-tokenization loosely splits up text by whitespace and then adds a visible character
Ġ to the start of each word. This way, when decoding, we know how to recover the spaces. In our
example above, the output of pre-tokenization would be: ["ĠHello", "ĠWorld"].
Once the text is split up using Byte Level pre-tokenization, we use the Byte-Pair Encoding Tokenizer to further split things up into tokens.
- First split every single visible byte character into its own separate, individual token:
[["Ġ", "H", "e", "l", "l", "o"], ["Ġ", "W", "o", "r", "l", "d"]]. - The tokenizer then looks at its pretrained vocabulary merge list, which is ordered from most common pairs in the training data to least common pairs. Imagine the merge list looks like this:
| Rule | Pair | Merged |
|---|---|---|
| 1 | Ġ + H | ĠH |
| 2 | o + r | or |
| 3 | l + d | ld |
| 4 | or + ld | orld |
The tokenizer then walks the list in order and applies every rule that matches, collapsing each word one merge at a time:
| Word | Apply | Tokens |
|---|---|---|
Hello | start | ["Ġ", "H", "e", "l", "l", "o"] |
| rule 1 | ["ĠH", "e", "l", "l", "o"] | |
World | start | ["Ġ", "W", "o", "r", "l", "d"] |
| rule 2 | ["Ġ", "W", "or", "l", "d"] | |
| rule 3 | ["Ġ", "W", "or", "ld"] | |
| rule 4 | ["Ġ", "W", "orld"] |
- Once no more rules apply, you stop matching. Each of the elements in the resulting arrays is a token that gets mapped to an integer based on a fixed vocabulary lookup table that maps tokens to numbers.
- For Spanish specifically, we add a
[BOS]token to the beginning of the sentence and an[EOS]token to the end of the sentence. This conditions an LLM to learn the starting and ending of a sentence. Critically, when running inference on our model, we know to stop further generation after receiving an[EOS]token.
Padding
After we apply tokenization to the English and Spanish columns in our dataset, each column is a list of lists of integers. The outer list holds the sentences, and each inner list holds the tokens for one sentence.
The last step of data preparation is to convert this big nested list to a set of tensors. We do so using the following process:
- Select English -> Spanish sentence pairs. Each set of pairs is called a batch.
- Convert each batch into an English tensor of dimension and a Spanish tensor of dimension , where and are the token counts of the longest English and Spanish sentences in the batch. Pad any shorter sentence to match the longest by adding pad tokens to its end.
Building the Transformer
We build a full encoder-decoder transformer. Encoder-decoder transformers are good at seq-to-seq tasks, which is a fancy way of saying it can generate a sequence of text based on a different sequence of text. Do you see why this is useful for our translation task? We are taking an English sequence and generating a Spanish sequence.
Transformer Diagram
The full architecture of the encoder-decoder transformer, as described in the Attention is All You Need paper, looks like this:
An Intuitive Look at the Transformer
The original paper already does a good job of explaining the technical details. So my explanation will only discuss the raw intuition. Here are the key components you need to know:
-
The embedding layer converts each tensor generated in the padding section from size to , where is the sequence length. It does so by generating two embeddings:
- Token Embedding: Takes the token (represented as an integer) and maps it to a vector of size . This vector is supposed to encode the “meaning” of the word by itself.
- Position Embedding: Takes the position of the token and maps that to a second vector of size . This is separate from the Token Embedding, and allows the transformer to account for the positioning of the tokens.
Both of these embeddings are summed together to form the final embedding.
-
Each block contains both attention layers and feed-forward layers. The attention layers are focused on context gathering. Intuitively, this means each token’s vector updates itself by looking at its neighbors. For example, the word “plane” means very different things in “Did you see that plane take off?” and “The xy coordinate plane spans .” The attention layer allows the hidden vector for “plane” to capture these distinct meanings. Once the context has been gathered, the feed-forward layer focuses on understanding. It “thinks” about what was gathered.
-
The encoder block (left side of the diagram) is focused on general understanding while the decoder block (right side of the diagram) is focused on generation. In our translation task specifically, this means that the English sequence is fed to the encoder (so the transformer can understand it in its entirety), and the Spanish sequence is what the decoder operates on (as the task is to generate Spanish text).
Technical detail. This is why you see a multi-head attention block in encoders, but a masked multi-head attention block in decoders. The difference is that encoders allow hidden vectors to learn from future hidden vectors during the context gathering phase (as the goal is raw understanding). Decoders only allow hidden vectors to learn from previous hidden vectors because you don’t have access to future hidden vectors when generating text.
-
The second attention block in the decoder (the non-masked multi-head attention) is known as cross-attention. It allows the decoder block to learn from what the encoder block “understood” when it generates its own content. This is important because when generating the Spanish sentence, the decoder needs to know what it already generated (the masked multi-head attention) along with the meaning of the English sentence (multi-head attention that connects with the encoder).
-
The linear layer at the top of the decoder block converts each hidden vector with dimension to another vector with dimension where is the size of our vocabulary (total number of tokens). The value at each position in this new vector is known as a “logit” and represents the likelihood that token is the next token. The softmax layer converts these logits into a probability distribution.
Producing Spanish Text
Let’s say we want to generate Spanish for the English sentence “What country do you want to travel to?”. The generation loop works like this:
- Tokenize the input sentence as described in the tokenization section above.
- Seed the Spanish generation with a
[BOS]token. - Feed the English sentence to the encoder block, and feed the Spanish text (currently just the
[BOS]token) to the decoder. - The transformer outputs a new token, which we append to the Spanish sequence.
- Repeat from step 3 with the growing Spanish sequence until the transformer outputs an
[EOS]token. - Convert the integers back to their string tokens by looking at our vocabulary map, and replace the “Ġ” character with a whitespace.
A Note About Teacher Forcing
In order to make training efficient, the transformer outputs a next token prediction for every token in the Spanish sequence in a single pass. This works because during training we already know the full answer. We feed the decoder the ground truth previous tokens (the Spanish sequence shifted right, exactly the “shifted right” input from the diagram), and we compare its predictions against that same sequence shifted left by one (each position’s target is simply the next token) to compute the loss. The model never uses its own predictions during training. Every prediction is conditioned on the correct history, as if a teacher were handing it the right previous tokens at each step. That is what “teacher forcing” means.
Concretely, let’s say we are translating the English sentence “What country do you want to travel to?” to the Spanish sentence “¿A qué país quieres viajar?”. We feed the decoder the real Spanish tokens (shifted right) and have it predict the next token at every position at once, scoring each prediction against the true next token.
During inference, though, we don’t have ground truth data. We are focused on pure generation, so the decoder uses its own predictions, and we just look at the next token prediction for the last token in the sequence.