Part 6: The Eyes of the Model – Self-Attention of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 6: The Eyes of the Model – Self-Attention

Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
How Transformers Learned to Look Everywhere, All at Once

Introduction: The Heart of the Revolution

In Parts 1 through 5, we journeyed from Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) to word embeddings, encoders, and decoders, building the foundation of the Transformer architecture. At the core of this leap lies self-attention, a mechanism that allows models to process entire sequences simultaneously, capturing relationships across words regardless of distance. Introduced in the 2017 paper "Attention is All You Need" by Vaswani et al., self-attention powers models like BERT and GPT. This article, the sixth in an 8-part series, explores how self-attention works, why it matters, and its wide-ranging impact.

What Is Self-Attention?

Self-attention is the process that enables a model to assess the importance of every word in a sentence relative to every other word. Rather than processing text sequentially like RNNs, Transformers use self-attention to "see" the whole sentence at once, determining which words influence each other most. This global perspective is key to understanding complex language patterns.

Example

Consider the sentence: "The cat sat on the mat because it was tired."
To interpret "it," the model needs to connect it to "cat" (the subject), not "mat" (the location), despite "cat" being five words away. Self-attention makes this possible by focusing on relevant tokens, no matter their position.

How Self-Attention Works – In 4 Steps

Given an input sequence of tokens ( [x₁, x₂, ..., x_n), where each token is an embedding vector, self-attention proceeds as follows:

Query (Q), Key (K), and Value (V)
- Each input vector is linearly transformed into three vectors:
  - Query (Q): What this word is looking for.
  - Key (K): What this word contains.
  - Value (V): What this word contributes.
- Mathematically:
  []
  []
  []
  where () are weight matrices.
Compute Attention Scores
- The relevance between two words is measured by the dot product of their Query and Key:
- This score indicates how much attention ( x_i ) should pay to ( x_j).
Apply Softmax
- Normalize the scores to obtain attention weights:
- Each row of () represents the attention distribution, showing how much each token focuses on others.
Weighted Sum of Values
- The output for each token is a weighted sum of all Value vectors:
- This combines information from all tokens, weighted by their relevance.

In Simple Terms

For each word:

It looks at every other word in the sequence.
It asks: "How much should I care about you?"
It combines their information based on the answer, creating a context-rich representation.

Why It Matters

Self-attention transforms language modeling by:

Capturing Long-Range Dependencies: Linking "it" to "cat" across sentences.
Focusing on Relevance: Prioritizing meaningful words regardless of position.
Processing in Parallel: Enabling fast training on GPUs, unlike sequential RNNs.
Learning Relationships: Discovering syntactic (e.g., subject-verb) and semantic (e.g., synonyms) patterns naturally.

Multi-Head Attention

Transformers enhance self-attention with multi-head attention, running multiple attention mechanisms in parallel (typically 8 or 12 heads).

Each head learns different relationships:
- One might focus on syntax (e.g., "cat" to "sat").
- Another might detect meaning (e.g., "cat" to "kitten").
The outputs are concatenated and passed through a linear layer, enriching the model’s understanding.

A diagram of a product attention and multi-head attention

Description automatically generated

Causal Masking (Decoder)

In decoders, self-attention is modified with a causal mask to prevent tokens from attending to future positions, enforcing left-to-right generation.

The mask is a triangular matrix, setting attention scores for future tokens to negative infinity (effectively zero after softmax).
This ensures the model predicts "The" before "cat," mimicking human writing.

Self-Attention vs. Traditional Attention

Feature	Traditional (Seq2Seq)	Self-Attention (Transformer)
Source of Attention	Encoder to Decoder	Within same sequence
Sequential	Yes	No (fully parallel)
Long-Distance Dependency	Weak	Strong
Scalability	Limited	Highly scalable

This shift from sequential to parallel processing marks a fundamental advance.

Visual Summary

For the input ( ["The", "cat", "sat", "on", "the", "mat"] ), consider the token "sat":

Query compares to the Keys of all tokens.
Attention Vector: ( [low, high, self, low, low, low] ), with "cat" receiving high attention.
Output: A weighted sum of all Values, emphasizing "cat"’s context.

This process repeats for each token, building a holistic representation.

Code Sketch (PyTorch-style)

Here’s a Python-style implementation of scaled dot-product attention:

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Compute attention scores
    d_k = K.size(-1)  # Dimension of keys
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
   
    # Apply mask if provided (e.g., causal mask)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
   
    # Apply softmax to get attention weights
    weights = F.softmax(scores, dim=-1)
   
    # Weighted sum of values
    output = torch.matmul(weights, V)
    return output

# Example usage
batch_size, n_heads, seq_len, d_k = 1, 8, 6, 64
Q = torch.randn(batch_size, n_heads, seq_len, d_k)
K = torch.randn(batch_size, n_heads, seq_len, d_k)
V = torch.randn(batch_size, n_heads, seq_len, d_k)
mask = torch.tril(torch.ones(seq_len, seq_len)) # Causal mask
output = scaled_dot_product_attention(Q, K, V, mask)
print("Output shape:", output.shape)  # Expected: [1, 8, 6, 64]

Explanation: This code computes attention scores via dot product, scales by ( \sqrt{d_k} ), applies a causal mask (lower triangular), and produces a weighted output. The multi-head dimension allows parallel attention, matching the Transformer’s design.

Why It Changed Everything

Self-attention is the engine driving:

Translation: Aligning source and target languages.
Text Summarization: Identifying key points.
Text Generation: Powering GPT and ChatGPT.
Image Captioning: Linking visual and textual data.
Protein Folding: As in AlphaFold.
Code Generation: Writing programs.
Music Creation: Composing melodies.

It’s not just a technique—it’s the core abstraction enabling general intelligence in Transformer-based models.

Up Next: Part 7 – Parallel Processing: Transformers at Scale

With self-attention as the foundation, we’ll explore in Part 7 why Transformers process data in parallel, unlike RNNs, and how this scalability birthed models like GPT.

Newsletter

Part 6: The Eyes of the Model – Self-Attention of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution