🏗️ MODULE 4: Complete Architecture — Encoder, Decoder, BERT, GPT, and Variants

Estimated duration of this module: 2 - 2.5 hours
Objective: Understand how Transformer blocks are assembled into complete architectures, and why models like BERT and GPT—though sharing the same foundation—are used for radically different tasks.


Lesson 4.1 — The Original Transformer: A Two-Part Architecture

The paper “Attention Is All You Need” (2017) did not introduce just an attention mechanism. It presented a complete architecture originally designed for machine translation tasks.

This architecture has two main components:

  • Encoder: Processes the input sequence (e.g., a sentence in English).
  • Decoder: Generates the output sequence (e.g., the translation in Spanish), one word at a time.

🔹 Useful analogy:

Imagine a simultaneous interpreter at a conference.

  • The encoder is like their ear and brain: listens to and fully understands the speaker’s sentence.
  • The decoder is like their mouth: generates the translation word by word, based on what was understood—and even self-corrects if a mistake is made!

Lesson 4.2 — What Does the Encoder Do?

The encoder is a stack of identical layers (e.g., 6 or 8 layers in the original Transformer). Each layer has two main sub-layers:

  1. Multi-Head Self-Attention:
    Each word attends to all other words in the same sentence. This allows each word to be “redefined” in the context of the entire sentence.

  2. Feed-Forward Neural Network (FFN):
    A simple (but powerful) neural network applied independently to each word. It serves to nonlinearly transform each word’s representation.

Additionally, each sub-layer is wrapped in:

  • Residual Connection (“skip connection”): Adds the original input to the output. This helps gradients flow better during training.
  • Layer Normalization: Normalizes activations to stabilize and accelerate training.

🔹 Final output of the Encoder:
A contextualized representation of each word in the input sentence. This representation captures not only the word’s meaning but also its relationship to all others.


Lesson 4.3 — What Does the Decoder Do?

The decoder is also a stack of identical layers, but with three sub-layers instead of two:

  1. Masked Multi-Head Self-Attention:
    Here’s the key difference. The decoder also applies self-attention, but with a mask that prevents each word from “looking into the future.”

    When generating word 3, it can only attend to words 1 and 2. It cannot cheat by peeking at word 4!

    This is essential for text generation, because in the real world—when you write or speak—you don’t know what word comes next.

  2. Multi-Head Cross-Attention:
    Here, the decoder “looks at” the encoder’s output. Queries come from the decoder; Keys and Values come from the encoder.

    This allows each word the decoder is generating to “ask” the encoder: “Which part of the original sentence is relevant to what I want to say now?”

  3. Feed-Forward Neural Network (FFN):
    Same as in the encoder.

It also uses residual connections and layer normalization.


Lesson 4.4 — The “Chef and Food Critic” Analogy

Imagine you’re cooking a new dish (generating text).

  • The encoder is like a food critic who has already tasted all the ingredients (input words) and gives you a detailed report: “The garlic is fine, but lacks acidity; the tomato is sweet—pair it with something sour.”

  • The decoder is like you, the chef, adding ingredients one by one (word by word).

    • At each step, you consult your own recipe so far (masked self-attention).
    • Then, you consult the critic: “What ingredient should I use now, based on what I have and your recommendations?” (cross-attention).
    • Finally, you adjust the flavor (FFN).

And so, step by step, you generate a coherent and delicious dish!


Lesson 4.5 — Evolution: From Encoder-Decoder to Encoder-only and Decoder-only

Over time, researchers realized they didn’t always need both parts.

1. Encoder-only Architecture → BERT, RoBERTa, etc.

  • Uses only the encoder.
  • Ideal for tasks where text is not generated, but understood or classified.
  • Examples: sentiment analysis, text classification, question answering (extraction), NER.
  • Advantage: Can see the full context (past and future) of each word.

2. Decoder-only Architecture → GPT, Llama, Mistral, etc.

  • Uses only the decoder (without cross-attention, since there’s no encoder).
  • Ideal for autoregressive text generation tasks.
  • Example: chatbots, story generation, summarization, code.
  • Key feature: Causal masked attention — can only look backward.

3. Encoder-Decoder Architecture → T5, BART, etc.

  • Uses both parts.
  • Ideal for sequence-to-sequence transformation tasks: translation, summarization, paraphrasing.
  • The encoder understands the input; the decoder generates the output.

Lesson 4.6 — BERT: The King of Understanding (Encoder-only)

BERT (Bidirectional Encoder Representations from Transformers), released by Google in 2018, caused a revolution.

🔹 Key innovation:
Bidirectional training. Unlike RNNs or GPT (which only look backward), BERT can see the full context simultaneously.

🔹 Training task:
“Masked Language Modeling” — randomly masks words in a sentence and asks the model to predict them using left and right context.

Example:
“The [MASK] jumped over the fence.” → The model learns that “cat,” “dog,” “rabbit” are good predictions.

🔹 Result:
Extremely rich language representations, ideal for comprehension tasks.


Lesson 4.7 — GPT: The Master of Generation (Decoder-only)

GPT (Generative Pre-trained Transformer), released by OpenAI, takes the opposite path.

🔹 Key innovation:
Autoregressive training. Predicts the next word in a sequence, using only prior context.

Example:
“The cat jumped over the...” → predicts “fence,” “table,” “bed,” etc.

🔹 Training task:
“Language Modeling” — simply predict the next word, repeatedly, across billions of texts.

🔹 Result:
Incredibly fluent models for text generation, maintaining long-term coherence and following instructions.


Lesson 4.8 — Visual Comparison (described): BERT vs GPT

BERT (Encoder-only):
Input: [The] [cat] [jumped] [over] [the] [fence]
Processing: ALL words are processed together.
Attention: Each word can see ALL others (bidirectional).
Output: Contextualized vector for EACH word → ideal for classification or extraction.

GPT (Decoder-only):
Generation: Starts with <start>, then generates one word at a time.
Step 1: <start> → generates "The"
Step 2: <start> + "The" → generates "cat"
Step 3: <start> + "The" + "cat" → generates "jumped"
...
Attention: At each step, can only see previous words (causal/masked).
Output: A generated sequence → ideal for creating new text.

Lesson 4.9 — Why Not Always Use the Full Model (Encoder-Decoder)?

Because it’s not always necessary… and it’s more expensive!

  • If you only want to understand text (e.g., “Is this tweet positive or negative?”), BERT is more efficient.
  • If you only want to generate text (e.g., “Write a poem about the sea”), GPT is more direct.
  • If you want to transform text into another (e.g., “Translate this to French” or “Summarize this article”), then you need encoder-decoder.

It’s like choosing tools:

  • Do you only need a screwdriver? Don’t buy a full toolbox.
  • Are you building a house? Then you need the full set.

✍️ Reflection Exercise 4.1

Think of three different NLP tasks. For each, decide whether you’d use a BERT-style (encoder-only), GPT-style (decoder-only), or T5-style (encoder-decoder) model. Justify your choice.

Example:
Task: “Extract the person’s name mentioned in a news article.”
Choice: BERT → because it’s an extraction/comprehension task, not generation.


📊 Conceptual Diagram 4.1 — Transformer Architectures (described)

Original Transformer (Translation):
[Input: "Hello world"] → ENCODER → [Representations] → DECODER → [Output: "Hola mundo"]

BERT (Sentiment Classification):
[Input: "I loved the movie"] → ENCODER → [CLS Vector] → Classifier → "POSITIVE"

GPT (Text Generation):
<start> → DECODER → "Today" → DECODER → "is" → DECODER → "a" → DECODER → "great" → ... → "day."

🧠 Module 4 Conclusion

The Transformer is not a single model, but a family of architectures.

  • The encoder is the “analyst”: deeply understands input text.
  • The decoder is the “creator”: generates new text step by step, aware of the past.
  • Together, they are a “perfect translator.”

BERT and GPT are two sides of the same coin: one for understanding, one for creation. Their popularity is no accident—each is optimized for its purpose.

Now that we understand the architecture, it’s time to get hands-on with code! In the next module, we’ll learn to use real Transformer models with Hugging Face—no need to understand every weight or neuron. We’ll load a model, feed it text, and get answers… like magic (but we know it’s not)!


Course Info

Course: AI-course2

Language: EN

Lesson: Module4