Part 6: The Eyes of the Model – Self-Attention
Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
How Transformers Learned to Look Everywhere, All at Once
Introduction: The Heart of the Revolution
In Parts 1 through 5, we journeyed from Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) to word embeddings, encoders, and decoders, building the foundation of the Transformer architecture. At the core of this leap lies self-attention, a mechanism that allows models to process entire sequences simultaneously, capturing relationships across words regardless of distance. Introduced in the 2017 paper "Attention is All You Need" by Vaswani et al., self-attention powers models like BERT and GPT. This article, the sixth in an 8-part series, explores how self-attention works, why it matters, and its wide-ranging impact.
What Is Self-Attention?
Self-attention is the process that enables a model to assess the importance of every word in a sentence relative to every other word. Rather than processing text sequentially like RNNs, Transformers use self-attention to "see" the whole sentence at once, determining which words influence each other most. This global perspective is key to understanding complex language patterns.
Example
Consider the sentence: "The cat sat on the mat because it was tired."
To interpret "it," the model needs to connect it to "cat" (the subject), not "mat" (the location), despite "cat" being five words away. Self-attention makes this possible by focusing on relevant tokens, no matter their position.
How Self-Attention Works – In 4 Steps
Given an input sequence of tokens ( [x1, x2, ..., xn), where each token is an embedding vector, self-attention proceeds as follows:
Query (Q), Key (K), and Value (V)
- Each input vector is linearly transformed into three vectors:
- Query (Q): What this word is looking for.
- Key (K): What this word contains.
- Value (V): What this word contributes.
- Mathematically:
[]
[]
[]
where () are weight matrices.
- Each input vector is linearly transformed into three vectors:
Compute Attention Scores
- The relevance between two words is measured by the dot product of their Query and Key:
- This score indicates how much attention ( xi ) should pay to ( xj ).
- The relevance between two words is measured by the dot product of their Query and Key:
Apply Softmax
- Normalize the scores to obtain attention weights:
- Each row of (
) represents the attention distribution, showing how much each token focuses on others.
- Normalize the scores to obtain attention weights:
Weighted Sum of Values
- The output for each token is a weighted sum of all Value vectors:
- This combines information from all tokens, weighted by their relevance.
- The output for each token is a weighted sum of all Value vectors:
In Simple Terms
For each word:
- It looks at every other word in the sequence.
- It asks: "How much should I care about you?"
- It combines their information based on the answer, creating a context-rich representation.
Why It Matters
Self-attention transforms language modeling by:
- Capturing Long-Range Dependencies: Linking "it" to "cat" across sentences.
- Focusing on Relevance: Prioritizing meaningful words regardless of position.
- Processing in Parallel: Enabling fast training on GPUs, unlike sequential RNNs.
- Learning Relationships: Discovering syntactic (e.g., subject-verb) and semantic (e.g., synonyms) patterns naturally.
Multi-Head Attention
Transformers enhance self-attention with multi-head attention, running multiple attention mechanisms in parallel (typically 8 or 12 heads).
- Each head learns different relationships:
- One might focus on syntax (e.g., "cat" to "sat").
- Another might detect meaning (e.g., "cat" to "kitten").
- The outputs are concatenated and passed through a linear layer, enriching the model’s understanding.
Causal Masking (Decoder)
In decoders, self-attention is modified with a causal mask to prevent tokens from attending to future positions, enforcing left-to-right generation.
- The mask is a triangular matrix, setting attention scores for future tokens to negative infinity (effectively zero after softmax).
- This ensures the model predicts "The" before "cat," mimicking human writing.
Self-Attention vs. Traditional Attention
Feature | Traditional (Seq2Seq) | Self-Attention (Transformer) |
Source of Attention | Encoder to Decoder | Within same sequence |
Sequential | Yes | No (fully parallel) |
Long-Distance Dependency | Weak | Strong |
Scalability | Limited | Highly scalable |
This shift from sequential to parallel processing marks a fundamental advance.
Visual Summary
For the input ( ["The", "cat", "sat", "on", "the", "mat"] ), consider the token "sat":
- Query compares to the Keys of all tokens.
- Attention Vector: ( [low, high, self, low, low, low] ), with "cat" receiving high attention.
- Output: A weighted sum of all Values, emphasizing "cat"’s context.
This process repeats for each token, building a holistic representation.
Code Sketch (PyTorch-style)
Here’s a Python-style implementation of scaled dot-product attention:
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
# Compute attention scores
d_k = K.size(-1) # Dimension of keys
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Apply mask if provided (e.g., causal mask)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Apply softmax to get attention weights
weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(weights, V)
return output
# Example usage
batch_size, n_heads, seq_len, d_k = 1, 8, 6, 64
Q = torch.randn(batch_size, n_heads, seq_len, d_k)
K = torch.randn(batch_size, n_heads, seq_len, d_k)
V = torch.randn(batch_size, n_heads, seq_len, d_k)
mask = torch.tril(torch.ones(seq_len, seq_len)) # Causal mask
output = scaled_dot_product_attention(Q, K, V, mask)
print("Output shape:", output.shape) # Expected: [1, 8, 6, 64]
- Explanation: This code computes attention scores via dot product, scales by ( \sqrt{d_k} ), applies a causal mask (lower triangular), and produces a weighted output. The multi-head dimension allows parallel attention, matching the Transformer’s design.
Why It Changed Everything
Self-attention is the engine driving:
- Translation: Aligning source and target languages.
- Text Summarization: Identifying key points.
- Text Generation: Powering GPT and ChatGPT.
- Image Captioning: Linking visual and textual data.
- Protein Folding: As in AlphaFold.
- Code Generation: Writing programs.
- Music Creation: Composing melodies.
It’s not just a technique—it’s the core abstraction enabling general intelligence in Transformer-based models.
Up Next: Part 7 – Parallel Processing: Transformers at Scale
With self-attention as the foundation, we’ll explore in Part 7 why Transformers process data in parallel, unlike RNNs, and how this scalability birthed models like GPT.
Featured Blogs

BCG Digital Acceleration Index

Bain’s Elements of Value Framework

McKinsey Growth Pyramid

McKinsey Digital Flywheel

McKinsey 9-Box Talent Matrix

McKinsey 7S Framework

The Psychology of Persuasion in Marketing

The Influence of Colors on Branding and Marketing Psychology

What is Marketing?
Recent Blogs

Part 8: From Blocks to Brilliance – How Transformers Became Large Language Models (LLMs) of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 7: The Power of Now – Parallel Processing in Transformers of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 6: The Eyes of the Model – Self-Attention of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 5: The Generator – Transformer Decoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution
