Part 3: Giving Words Meaning – Word Embeddings
Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
Turning Words into Vectors Before It Was Cool
Introduction: The Challenge of Language for Machines
In Parts 1 and 2, we saw how Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) learned to process sequences, but they faced a fundamental hurdle: computers don’t understand words like "cat" or "democracy"—they need numbers. Early attempts to convert language into data were crude, but the invention of word embeddings revolutionized this process by giving words numerical meaning based on context. This article, the third in an 8-part series, explores how word embeddings work, why they matter, and their limitations, setting the stage for the transformer revolution.
The Problem: Words Are Not Numbers
Neural networks thrive on numerical input, yet human language is symbolic. Feeding "cat" or "love" directly into a model is impossible without translation. Early solutions included:
- One-Hot Encoding: Each word gets a vector with a 1 in its position and 0s elsewhere (e.g., a 10,000-dimensional vector for a 10,000-word vocabulary).
- Bag-of-Words (BoW): Counts word frequency in a text, ignoring order and meaning (e.g., {"cat": 2, "dog": 1}).
These methods were sparse (mostly zeros), inefficient (huge vectors), and lacked semantic understanding—treating "cat" and "kitten" as unrelated. We needed a smarter way to represent words.
The Breakthrough: Word Embeddings
Word embeddings solved this by mapping words to dense, low-dimensional vectors (e.g., 300 numbers) that capture meaning based on how words are used together. The guiding principle, coined by linguist J.R. Firth, was: "You shall know a word by the company it keeps." If "king" often appears with "royal" or "crown," its vector should reflect that proximity.
This shift turned language into a mathematical space where meaning emerges from context, enabling machines to generalize and reason about words.
How Word Embeddings Work
Two landmark methods paved the way:
Word2Vec (2013, Google)
- Approach: Trains a shallow neural network with two strategies:
- CBOW (Continuous Bag-of-Words): Predicts a word from its surrounding words (e.g., predicting "cat" from "the fluffy").
- Skip-Gram: Predicts surrounding words from a target word (e.g., predicting "the" and "fluffy" from "cat").
- Result: Vectors where:
- Similar-meaning words (e.g., "king" and "queen") are close in space.
- Algebraic relationships hold: king−man+woman≈queen
- Approach: Trains a shallow neural network with two strategies:
GloVe (2014, Stanford)
- Approach: Uses global word co-occurrence statistics from a large corpus, rather than prediction.
- Result: Embeddings that capture both local context (nearby words) and global patterns (overall corpus trends).
A later improvement, fastText, added subword information to handle out-of-vocabulary (OOV) words and morphological variations (e.g., "cats" and "cat").
Visualizing Word Embeddings
Plotting word vectors using techniques like t-SNE or PCA reveals fascinating patterns:
- Synonyms Cluster: "happy" and "joyful" group together.
- Word Groups: Animals, professions, or emotions form distinct regions.
- Encoded Relationships: Directions in the vector space reflect gender (e.g., "king" to "queen"), plurality, or tense.
This visualization underscores how embeddings encode meaning beyond raw counts.
Why Embeddings Matter
Word embeddings transformed NLP by:
- Turning Language into Math: Providing a numerical foundation for neural networks.
- Enabling Generalization: Allowing models to understand similar words (e.g., "dog" and "puppy") without explicit training.
- Powering Applications: Driving semantic search, sentiment analysis, and classification.
- Serving as Input: Forming the foundation for LSTMs, GRUs, and eventually transformers.
Limitations of Early Embeddings
Despite their power, early word embeddings had a critical flaw: they were static.
- A word like "bank" had the same vector whether it meant a riverbank or a financial institution, ignoring sentence context.
- This lack of adaptability limited their use in complex, context-dependent tasks, setting the stage for transformers to introduce contextual embeddings, where a word’s meaning shifts based on its usage.
Code Example: Load Pretrained Word2Vec
Here’s how to use a pretrained Word2Vec model with the gensim library to explore word similarities:
from gensim.models import KeyedVectors
# Load pretrained Word2Vec model (Google News vectors)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# Find words most similar to "king"
similar_words = model.most_similar('king', topn=5)
print("Words similar to 'king':", similar_words)
# Demonstrate vector arithmetic
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("king - man + woman ≈", result)
Sample Output
Assuming the Google News Word2Vec model is loaded (a 3GB file with 300-dimensional vectors trained on 100 billion words), the output might look like:
Words similar to 'king': [('queen', 0.651095), ('prince', 0.604983), ('monarch', 0.528169), ('throne', 0.521253), ('kingdom', 0.498492)]
king - man + woman ≈ [('queen', 0.765432)]
- Note: Exact values depend on the model and data. The untrained example from Part 2 is replaced here with a real-world pretrained model for relevance. You’ll need to download the GoogleNews-vectors-negative300.bin file from the gensim website or a similar source.
Recap Table
Model | Method | Strength |
Word2Vec | Predictive (context/word) | Fast, intuitive |
GloVe | Count-based (co-occurrence) | Global + local semantics |
fastText | Subword info (handles OOV) | Robust to typos, morphologies |
Up Next: Part 4 – The Encoder: Understanding Context in Full View
With word embeddings as the input, we now turn to how transformers use the encoder to understand context across an entire sequence. Stay tuned!
Featured Blogs

BCG Digital Acceleration Index

Bain’s Elements of Value Framework

McKinsey Growth Pyramid

McKinsey Digital Flywheel

McKinsey 9-Box Talent Matrix

McKinsey 7S Framework

The Psychology of Persuasion in Marketing

The Influence of Colors on Branding and Marketing Psychology

What is Marketing?
Recent Blogs

Part 8: From Blocks to Brilliance – How Transformers Became Large Language Models (LLMs) of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 7: The Power of Now – Parallel Processing in Transformers of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 6: The Eyes of the Model – Self-Attention of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 5: The Generator – Transformer Decoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 4: The Comprehender – Transformer Encoders of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution
