Estimated duration of this module: 2 - 2.5 hours
Objective: Understand how Transformer blocks are assembled into complete architectures, and why models like BERT and GPTâthough sharing the same foundationâare used for radically different tasks.
The paper âAttention Is All You Needâ (2017) did not introduce just an attention mechanism. It presented a complete architecture originally designed for machine translation tasks.
This architecture has two main components:
đš Useful analogy:
Imagine a simultaneous interpreter at a conference.
- The encoder is like their ear and brain: listens to and fully understands the speakerâs sentence.
- The decoder is like their mouth: generates the translation word by word, based on what was understoodâand even self-corrects if a mistake is made!
The encoder is a stack of identical layers (e.g., 6 or 8 layers in the original Transformer). Each layer has two main sub-layers:
Multi-Head Self-Attention:
Each word attends to all other words in the same sentence. This allows each word to be âredefinedâ in the context of the entire sentence.
Feed-Forward Neural Network (FFN):
A simple (but powerful) neural network applied independently to each word. It serves to nonlinearly transform each wordâs representation.
Additionally, each sub-layer is wrapped in:
đš Final output of the Encoder:
A contextualized representation of each word in the input sentence. This representation captures not only the wordâs meaning but also its relationship to all others.
The decoder is also a stack of identical layers, but with three sub-layers instead of two:
Masked Multi-Head Self-Attention:
Hereâs the key difference. The decoder also applies self-attention, but with a mask that prevents each word from âlooking into the future.â
When generating word 3, it can only attend to words 1 and 2. It cannot cheat by peeking at word 4!
This is essential for text generation, because in the real worldâwhen you write or speakâyou donât know what word comes next.
Multi-Head Cross-Attention:
Here, the decoder âlooks atâ the encoderâs output. Queries come from the decoder; Keys and Values come from the encoder.
This allows each word the decoder is generating to âaskâ the encoder: âWhich part of the original sentence is relevant to what I want to say now?â
Feed-Forward Neural Network (FFN):
Same as in the encoder.
It also uses residual connections and layer normalization.
Imagine youâre cooking a new dish (generating text).
The encoder is like a food critic who has already tasted all the ingredients (input words) and gives you a detailed report: âThe garlic is fine, but lacks acidity; the tomato is sweetâpair it with something sour.â
The decoder is like you, the chef, adding ingredients one by one (word by word).
And so, step by step, you generate a coherent and delicious dish!
Over time, researchers realized they didnât always need both parts.
BERT (Bidirectional Encoder Representations from Transformers), released by Google in 2018, caused a revolution.
đš Key innovation:
Bidirectional training. Unlike RNNs or GPT (which only look backward), BERT can see the full context simultaneously.
đš Training task:
âMasked Language Modelingâ â randomly masks words in a sentence and asks the model to predict them using left and right context.
Example:
âThe [MASK] jumped over the fence.â â The model learns that âcat,â âdog,â ârabbitâ are good predictions.
đš Result:
Extremely rich language representations, ideal for comprehension tasks.
GPT (Generative Pre-trained Transformer), released by OpenAI, takes the opposite path.
đš Key innovation:
Autoregressive training. Predicts the next word in a sequence, using only prior context.
Example:
âThe cat jumped over the...â â predicts âfence,â âtable,â âbed,â etc.
đš Training task:
âLanguage Modelingâ â simply predict the next word, repeatedly, across billions of texts.
đš Result:
Incredibly fluent models for text generation, maintaining long-term coherence and following instructions.
BERT (Encoder-only):
Input: [The] [cat] [jumped] [over] [the] [fence]
Processing: ALL words are processed together.
Attention: Each word can see ALL others (bidirectional).
Output: Contextualized vector for EACH word â ideal for classification or extraction.
GPT (Decoder-only):
Generation: Starts with <start>, then generates one word at a time.
Step 1: <start> â generates "The"
Step 2: <start> + "The" â generates "cat"
Step 3: <start> + "The" + "cat" â generates "jumped"
...
Attention: At each step, can only see previous words (causal/masked).
Output: A generated sequence â ideal for creating new text.
Because itâs not always necessary⌠and itâs more expensive!
Itâs like choosing tools:
Think of three different NLP tasks. For each, decide whether youâd use a BERT-style (encoder-only), GPT-style (decoder-only), or T5-style (encoder-decoder) model. Justify your choice.
Example:
Task: âExtract the personâs name mentioned in a news article.â
Choice: BERT â because itâs an extraction/comprehension task, not generation.
Original Transformer (Translation):
[Input: "Hello world"] â ENCODER â [Representations] â DECODER â [Output: "Hola mundo"]
BERT (Sentiment Classification):
[Input: "I loved the movie"] â ENCODER â [CLS Vector] â Classifier â "POSITIVE"
GPT (Text Generation):
<start> â DECODER â "Today" â DECODER â "is" â DECODER â "a" â DECODER â "great" â ... â "day."
The Transformer is not a single model, but a family of architectures.
- The encoder is the âanalystâ: deeply understands input text.
- The decoder is the âcreatorâ: generates new text step by step, aware of the past.
- Together, they are a âperfect translator.â
BERT and GPT are two sides of the same coin: one for understanding, one for creation. Their popularity is no accidentâeach is optimized for its purpose.
Now that we understand the architecture, itâs time to get hands-on with code! In the next module, weâll learn to use real Transformer models with Hugging Faceâno need to understand every weight or neuron. Weâll load a model, feed it text, and get answers⌠like magic (but we know itâs not)!