Part 4: The Comprehender – Transformer Encoders
Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
How Machines Learned to Understand Entire Sentences at Once
Introduction: A New Way to See Language
In Parts 1 through 3, we traced the evolution from Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) to word embeddings, which gave machines a way to process and represent language. Yet, these models struggled with speed and long-range context. The Transformer encoder, introduced in the landmark 2017 paper "Attention is All You Need" by Vaswani et al., changed that by processing entire sentences simultaneously. This article, the fourth in an 8-part series, explores how encoders work, their role in models like BERT, and why they mark a leap forward in understanding language.
What is an Encoder in the Transformer?
The Transformer architecture, a cornerstone of modern AI, splits its work between an encoder and a decoder. The encoder’s job is to process an input sequence—say, a sentence like "The cat sat on the mat"—and create context-aware representations for each word (or token). Unlike RNNs, which plod through text word by word, encoders use self-attention to analyze the entire sequence at once, capturing how every word relates to every other. This parallel processing, debuted in the Vaswani et al. paper, powers models like BERT, RoBERTa, and DistilBERT, revolutionizing natural language processing (NLP).
Anatomy of the Transformer Encoder
Each encoder is built from stacked blocks, typically 6 to 12 layers, with two core components:
Transformer Encoder Block: Parallel processing with self-attention and feed-forward layers

Multi-Head Self-Attention
- Allows each token to "pay attention" to every other token in the input.
- Captures relationships like subject-verb agreements or noun-adjective pairs across the sequence.
- "Multi-head" means it performs this attention multiple ways, enriching the representation.
Feed-Forward Neural Network (FFN)
- A small, position-wise multilayer perceptron (MLP) applied independently to each token.
- Adds nonlinear transformations and depth to the model.
Each component is enhanced with:
- Residual Connections: Skip connections that add the input to the output, aiding gradient flow.
- Layer Normalization: Stabilizes training by normalizing the outputs.
How Encoding Works – Step-by-Step
Let’s walk through how the encoder processes "The cat sat on the mat":
Token + Position Embedding
- Each word is converted into an embedding vector (from word embeddings, Part 3).
- Positional encodings are added to indicate word order (since attention is order-agnostic).
- Mathematically:
where zi includes positional info.
Self-Attention
- Each token compares itself with all others using dot-product attention.
- For example:
- "sat" might focus heavily on "cat" (the subject).
- "mat" might prioritize "on" (its location).
- Attention weights reflect these relationships, creating a contextual view.
Feed-Forward + Layer Norm
- Each token’s vector passes through the FFN, then a residual connection and layer normalization.
- The output is a new vector per token, now infused with the sentence’s full meaning.
These steps repeat across layers, refining the representations into rich, semantic encodings.
What Encoders Are Good At
Encoders excel at understanding inputs rather than generating outputs, making them ideal for:
- BERT-style Models: Used in masked language modeling (e.g., predicting hidden words).
- Text Classification: Determining sentiment or topic.
- Named Entity Recognition: Identifying names, places, etc.
- Semantic Search: Finding contextually similar texts.
- Question Answering: Extracting answers from passages.
Encoder Block – Code Metaphor
Here’s a Python-style pseudocode representation of an encoder block using PyTorch-like syntax:
import torch.nn as nn
class TransformerEncoderBlock(nn.Module):
def __init__(self, d_model, n_heads):
super(TransformerEncoderBlock, self).__init__()
self.attn = nn.MultiheadAttention(d_model, n_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.ReLU(),
nn.Linear(d_model * 4, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
# Self-attention with residual connection
attn_output, _ = self.attn(x, x, x)
x = self.norm1(x + attn_output)
# Feed-forward with residual connection
ffn_output = self.ffn(x)
x = self.norm2(x + ffn_output)
return x
# Example usage
d_model = 512 # Model dimension
n_heads = 8 # Number of attention heads
encoder_block = TransformerEncoderBlock(d_model, n_heads)
input_tensor = torch.randn(1, 10, d_model) # Batch size 1, sequence length 10
output = encoder_block(input_tensor)
print("Output shape:", output.shape) # Expected: [1, 10, 512]
- Explanation: This code defines an encoder block with multi-head attention and a feed-forward network, wrapped in residual connections and layer normalization. The forward method processes an input tensor, outputting a contextualized representation of the same shape.
Encoders vs. RNNs – The Leap
Feature | RNN | Transformer Encoder |
Processing | Sequential | Parallel |
Memory | Hidden state | Self-attention context |
Long-Range Context | Weak | Strong |
Speed | Slow | Fast on GPUs |
Scalability | Poor | Excellent |
With encoders, the model doesn’t "read" text word by word—it "sees" the whole sentence at once, a paradigm shift enabled by self-attention.
Encoders in Action: BERT
BERT (Bidirectional Encoder Representations from Transformers), released in 2018 by Google, showcases the encoder’s power:
- Design: Uses only the encoder stack, processing text bidirectionally.
- Training: Pre-trained with masked language modeling (e.g., predicting "[MASK]" in "The cat [MASK] on the mat").
- Fine-Tuning: Adapted for tasks like sentiment analysis or question answering.
- Impact: Powered Google’s improved search understanding and set NLP benchmarks in 2019.



Up Next: Part 5 – The Decoder: Speaking the Language
While encoders master understanding, decoders handle generation. In Part 5, we’ll explore how decoders power models like GPT and how the encoder-decoder pair drives tasks like translation.
Featured Blogs

BCG Digital Acceleration Index

Bain’s Elements of Value Framework

McKinsey Growth Pyramid

McKinsey Digital Flywheel

McKinsey 9-Box Talent Matrix

McKinsey 7S Framework

The Psychology of Persuasion in Marketing

The Influence of Colors on Branding and Marketing Psychology

What is Marketing?
Recent Blogs

Part 8: From Blocks to Brilliance – How Transformers Became Large Language Models (LLMs) of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 7: The Power of Now – Parallel Processing in Transformers of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 6: The Eyes of the Model – Self-Attention of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 5: The Generator – Transformer Decoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution
