Imagine you want to teach a computer to understand a sentence like:
“The cat jumped over the fence because it saw a bird.”
Before 2017, the most popular models for understanding language were Recurrent Neural Networks (RNNs) and their improved variants, such as LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units).
These networks worked like a conveyor belt of words: they processed one word at a time, in order, and maintained a kind of “memory” or “hidden state” that tried to remember what it had seen before.
🔹 Useful analogy:
Think of a person reading a book blindfolded, touching only one letter at a time with a finger. As they move forward, they try to mentally remember what they’ve read so far to understand the full meaning. It’s exhausting, slow, and they easily forget the beginning by the time they reach the end!
Despite their popularity, RNNs had three major limitations:
When a sentence is very long, the RNN “forgets” the first words. For example:
“At my grandfather’s farm, where I spent all my childhood summers, there was a dog named Toby, who... [20 words later]... I miss dearly.”
By the time the model reaches “I miss dearly,” it has lost the connection to “Toby.” This phenomenon is called vanishing gradient — technically complex, but conceptually: information “dilutes” over time.
Since it can only process one word at a time, it cannot be parallelized. On a GPU with thousands of cores, this is a massive waste. Like owning a Formula 1 car… but being forced to drive in first gear!
The RNN only “looks backward.” It cannot see the next word to better understand the current one. In many cases, the meaning of a word depends on what comes after.
Example: “I went to the bank to deposit money...” vs “I went to the river bank to fish...”
Only the subsequent context (“deposit money” or “river”) reveals which “bank” is meant.
In December 2017, a team of Google researchers published a paper that would forever change AI:
“Attention Is All You Need” — Vaswani et al., 2017
This paper introduced a radically new architecture: the Transformer.
Its big idea was simple, yet revolutionary:
“What if, instead of reading word by word, we read the entire sentence at once… and allow each word to ‘ask’ all others how relevant they are to understand itself?”
This is what’s called the attention mechanism.
And with that, a new era was born.
The Transformer solved the three major problems of RNNs:
✅ Perfect Long-Term Memory:
Since it processes all words together, there’s no fading. Each word can “look” at any other, regardless of distance.
✅ Parallel Processing:
Since it doesn’t depend on sequential order, the entire sentence can be processed at once. Uses GPUs at 100% capacity!
✅ Bidirectional Context (in some cases):
Each word can see both what came before and what comes after. This allows much more precise disambiguation.
Think of a long sentence where the meaning of a word at the beginning depends on a word at the end. Write it down. Then, imagine how an RNN and a Transformer would process it. Which one would find it easier? Why?
RNN:
[Word 1] → [Word 2] → [Word 3] → ... → [Word N]
↘ ↘ ↘ ... ↘
State → State → State → ... → State
Transformer:
[Word 1] [Word 2] [Word 3] ... [Word N]
↘_________↙_________↘_________↙_________↘
ATTENTION: All words communicate with each other
RNNs were the heroes of their time, but had structural limitations. The Transformer wasn’t just an incremental improvement — it was a paradigm shift. And all thanks to a seemingly simple idea: attention.
In the next module, we’ll dissect that idea: What is attention? How does it work? Why is it so powerful?