🧲 MODULE 2: The Attention Revolution — The “Superpower” That Changed Everything


Lesson 2.1 — What Is “Attention” in Transformers?

Attention is a mechanism that allows each word in a sentence to “ask” all other words: “How much should I pay attention to you to understand myself?”

Yes, literally.

Each word generates a “Query” (Q).
Each word also has a “Key” (K) and a “Value” (V).

The magic lies in how they relate.


Lesson 2.2 — Analogy of the “Collaborative Classroom”

Imagine you’re in a classroom, and the teacher asks you to define the word “bank.”

You (“bank”) can ask all your classmates:

  • “Hey, you’re ‘river’ — do you relate to me?” → Maybe, if we’re talking geography.
  • “You’re ‘money’ — do you relate to me?” → Absolutely! If we’re in economics.
  • “You’re ‘bench’ — do you relate to me?” → Possibly, if it’s a park bench.

Each classmate responds with a relevance score (based on their Key).
Then, you take responses from the most relevant ones (their Values), combine them, and form your context-aware definition.

That, in essence, is the attention mechanism.


Lesson 2.3 — The Three Magic Vectors: Query, Key, Value

Each word is converted into three vector representations:

  • Query (Q): “What am I looking for?” — The question the word asks.
  • Key (K): “What information do I offer?” — The answer the word gives to others’ questions.
  • Value (V): “What is my actual content?” — The information transmitted if someone pays attention to you.

🔹 Simplified Process:

  1. Multiply Q of word A with K of word B → get a “compatibility score.”
  2. Repeat for all combinations.
  3. Apply softmax to convert scores into probabilities (attention weights).
  4. Multiply those weights by the V of each word.
  5. Sum everything → get a new representation of word A, enriched by context.

Lesson 2.4 — Concrete Example: “bank” in Two Contexts

Take two sentences:

A) “I went to the bank to deposit my salary.”
B) “I sat on the park bench to read.”

When the model processes “bank” in sentence A:

  • Its Query (“What kind of bank am I?”) has high compatibility with Keys of “deposit,” “salary,” “money.”
  • Therefore, Values of those words strongly influence its final representation → understood as financial institution.

In sentence B:

  • High compatibility with “sat,” “park,” “read.”
  • Understood as a seat.

No rules, no dictionaries! Just contextual attention.


Lesson 2.5 — Scaled Dot-Product Attention

This is the technical name for the mechanism we just described.

Formula (for reference only — no need to memorize):

Attention(Q, K, V) = softmax( (Q · K^T) / √d_k ) · V

Where:

  • Q · K^T = dot product between queries and keys → similarity scores.
  • √d_k = scaling factor (for numerical stability).
  • softmax = converts scores into weights (sum to 1).
  • · V = weights the values by those scores.

🔹 Important: You don’t need to understand the formula to use Transformers. But you do need to grasp the concept: each word redefines itself based on its relationship with all others.


✍️ Reflection Exercise 2.1

Choose an ambiguous word (like “court,” “plant,” “flame”). Write two sentences where it has different meanings. Then, describe which other words in each sentence “should” have high attention with it, and why.


📊 Conceptual Diagram 2.1 — Attention Mechanism (described)

Word: "bank"
Query: [0.8, -0.2, 0.5] → "Seeking financial or geographical context"

Neighbor words:
- "deposit" → Key: [0.9, 0.1, -0.3] → HIGH compatibility → Value: [1.0, 0.8, ...]
- "park"    → Key: [-0.7, 0.6, 0.4] → LOW compatibility → Value: [0.2, 0.1, ...]
- "salary"  → Key: [0.85, -0.1, 0.2] → HIGH compatibility → Value: [0.9, 0.7, ...]

Attention weights (post-softmax):
deposit: 0.7
park:    0.1
salary:  0.2

New "bank" vector = 0.7 * Value_deposit + 0.1 * Value_park + 0.2 * Value_salary
→ Result: vector biased toward "financial institution"

🧠 Module 2 Conclusion

Attention is not magic — it’s an elegant mathematical mechanism that allows machines to understand language contextually, dynamically, and relationally. It’s the heart of the Transformer. And best of all: it’s fully parallelizable!

Now that we understand attention, let’s see how a full Transformer is built from its components: embeddings, position, and multi-head attention.


So far, we’ve covered approximately 2 hours of content.
Would you like me to continue with Module 3: Puzzle Pieces — Embeddings, Position, and Multi-Head Attention?
There, we’ll dive deeper into how words are encoded, why we need to know their position, and how “multi-head attention” allows understanding multiple types of relationships simultaneously.

Let me know if you’d like me to proceed — I’ll continue with the same level of detail, examples, analogies, and code.


Perfect! We continue with the same level of pedagogical depth, everyday analogies, and preparation for practical application. Let’s go to:


Course Info

Course: AI-course2

Language: EN

Lesson: Module2