Newsletter

Sign up to our newsletter to receive the latest updates

Rajiv Gopinath

Part 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Last updated:   July 06, 2025

Statistics and Data Science HubTransformersEncodersDecodersLSTMRNNNeuralNetworksDeepLearningNLPModelsSequenceModelingAIArchitecture
Part 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer RevolutionPart 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 4: The Comprehender – Transformer Encoders

Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
              How Machines Learned to Understand Entire Sentences at Once

Introduction: A New Way to See Language

In Parts 1 through 3, we traced the evolution from Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) to word embeddings, which gave machines a way to process and represent language. Yet, these models struggled with speed and long-range context. The Transformer encoder, introduced in the landmark 2017 paper "Attention is All You Need" by Vaswani et al., changed that by processing entire sentences simultaneously. This article, the fourth in an 8-part series, explores how encoders work, their role in models like BERT, and why they mark a leap forward in understanding language.

 

What is an Encoder in the Transformer?

The Transformer architecture, a cornerstone of modern AI, splits its work between an encoder and a decoder. The encoder’s job is to process an input sequence—say, a sentence like "The cat sat on the mat"—and create context-aware representations for each word (or token). Unlike RNNs, which plod through text word by word, encoders use self-attention to analyze the entire sequence at once, capturing how every word relates to every other. This parallel processing, debuted in the Vaswani et al. paper, powers models like BERT, RoBERTa, and DistilBERT, revolutionizing natural language processing (NLP).

 

Anatomy of the Transformer Encoder

Each encoder is built from stacked blocks, typically 6 to 12 layers, with two core components:

 

Transformer Encoder Block: Parallel processing with self-attention and feed-forward layers

 

  1. Multi-Head Self-Attention 

    • Allows each token to "pay attention" to every other token in the input.
    • Captures relationships like subject-verb agreements or noun-adjective pairs across the sequence.
    • "Multi-head" means it performs this attention multiple ways, enriching the representation.
  2. Feed-Forward Neural Network (FFN) 

    • A small, position-wise multilayer perceptron (MLP) applied independently to each token.
    • Adds nonlinear transformations and depth to the model.

Each component is enhanced with:

  • Residual Connections: Skip connections that add the input to the output, aiding gradient flow.
  • Layer Normalization: Stabilizes training by normalizing the outputs.

 

How Encoding Works – Step-by-Step

Let’s walk through how the encoder processes "The cat sat on the mat":

  1. Token + Position Embedding 

    • Each word is converted into an embedding vector (from word embeddings, Part 3).
    • Positional encodings are added to indicate word order (since attention is order-agnostic).
    • Mathematically:  where zi   includes positional info.
  2. Self-Attention

    • Each token compares itself with all others using dot-product attention.
    • For example: 
      • "sat" might focus heavily on "cat" (the subject).
      • "mat" might prioritize "on" (its location).
    • Attention weights reflect these relationships, creating a contextual view.
  3. Feed-Forward + Layer Norm 

    • Each token’s vector passes through the FFN, then a residual connection and layer normalization.
    • The output is a new vector per token, now infused with the sentence’s full meaning.

These steps repeat across layers, refining the representations into rich, semantic encodings.

 

What Encoders Are Good At

Encoders excel at understanding inputs rather than generating outputs, making them ideal for:

  • BERT-style Models: Used in masked language modeling (e.g., predicting hidden words).
  • Text Classification: Determining sentiment or topic.
  • Named Entity Recognition: Identifying names, places, etc.
  • Semantic Search: Finding contextually similar texts.
  • Question Answering: Extracting answers from passages.

 

Encoder Block – Code Metaphor

Here’s a Python-style pseudocode representation of an encoder block using PyTorch-like syntax:

import torch.nn as nn

class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super(TransformerEncoderBlock, self).__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Self-attention with residual connection
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_output)
        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        return x

# Example usage
d_model = 512  # Model dimension
n_heads = 8    # Number of attention heads
encoder_block = TransformerEncoderBlock(d_model, n_heads)
input_tensor = torch.randn(1, 10, d_model)  # Batch size 1, sequence length 10
output = encoder_block(input_tensor)
print("Output shape:", output.shape)  # Expected: [1, 10, 512]
  • Explanation: This code defines an encoder block with multi-head attention and a feed-forward network, wrapped in residual connections and layer normalization. The forward method processes an input tensor, outputting a contextualized representation of the same shape.

 

Encoders vs. RNNs – The Leap

FeatureRNNTransformer Encoder
ProcessingSequentialParallel
MemoryHidden stateSelf-attention context
Long-Range ContextWeakStrong
SpeedSlowFast on GPUs
ScalabilityPoorExcellent

With encoders, the model doesn’t "read" text word by word—it "sees" the whole sentence at once, a paradigm shift enabled by self-attention.

 

Encoders in Action: BERT

BERT (Bidirectional Encoder Representations from Transformers), released in 2018 by Google, showcases the encoder’s power:

  • Design: Uses only the encoder stack, processing text bidirectionally.
  • Training: Pre-trained with masked language modeling (e.g., predicting "[MASK]" in "The cat [MASK] on the mat").
  • Fine-Tuning: Adapted for tasks like sentiment analysis or question answering.
  • Impact: Powered Google’s improved search understanding and set NLP benchmarks in 2019.
A diagram of a diagram

Description automatically generated

 

A diagram of a network

Description automatically generated

 

A diagram of a process

Description automatically generated

Up Next: Part 5 – The Decoder: Speaking the Language

While encoders master understanding, decoders handle generation. In Part 5, we’ll explore how decoders power models like GPT and how the encoder-decoder pair drives tasks like translation.