Newsletter

Sign up to our newsletter to receive the latest updates

Rajiv Gopinath

Part 3: Giving Words Meaning – Word Embeddings of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Last updated:   July 06, 2025

Statistics and Data Science HubTransformersEncodersDecodersLSTMRNNNeuralNetworksDeepLearningNLPModelsSequenceModelingAIArchitecture
Part 3: Giving Words Meaning – Word Embeddings of the series - From Sequences to Sentience: Building Blocks of the Transformer RevolutionPart 3: Giving Words Meaning – Word Embeddings of the series - From Sequences to Sentience: Building Blocks of the Transformer Revolution

Part 3: Giving Words Meaning – Word Embeddings

Series: From Sequences to Sentience: Building Blocks of the Transformer Revolution
                   Turning Words into Vectors Before It Was Cool

Introduction: The Challenge of Language for Machines

In Parts 1 and 2, we saw how Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) learned to process sequences, but they faced a fundamental hurdle: computers don’t understand words like "cat" or "democracy"—they need numbers. Early attempts to convert language into data were crude, but the invention of word embeddings revolutionized this process by giving words numerical meaning based on context. This article, the third in an 8-part series, explores how word embeddings work, why they matter, and their limitations, setting the stage for the transformer revolution.

 

The Problem: Words Are Not Numbers

Neural networks thrive on numerical input, yet human language is symbolic. Feeding "cat" or "love" directly into a model is impossible without translation. Early solutions included:

  • One-Hot Encoding: Each word gets a vector with a 1 in its position and 0s elsewhere (e.g., a 10,000-dimensional vector for a 10,000-word vocabulary).
  • Bag-of-Words (BoW): Counts word frequency in a text, ignoring order and meaning (e.g., {"cat": 2, "dog": 1}).

These methods were sparse (mostly zeros), inefficient (huge vectors), and lacked semantic understanding—treating "cat" and "kitten" as unrelated. We needed a smarter way to represent words.

 

The Breakthrough: Word Embeddings

Word embeddings solved this by mapping words to dense, low-dimensional vectors (e.g., 300 numbers) that capture meaning based on how words are used together. The guiding principle, coined by linguist J.R. Firth, was: "You shall know a word by the company it keeps." If "king" often appears with "royal" or "crown," its vector should reflect that proximity.

This shift turned language into a mathematical space where meaning emerges from context, enabling machines to generalize and reason about words.

 

How Word Embeddings Work

Two landmark methods paved the way:

  1. Word2Vec (2013, Google) 

    • Approach: Trains a shallow neural network with two strategies: 
      • CBOW (Continuous Bag-of-Words): Predicts a word from its surrounding words (e.g., predicting "cat" from "the fluffy").
      • Skip-Gram: Predicts surrounding words from a target word (e.g., predicting "the" and "fluffy" from "cat").
    • Result: Vectors where: 
      • Similar-meaning words (e.g., "king" and "queen") are close in space.
      • Algebraic relationships hold: king−man+woman≈queen 
  2. GloVe (2014, Stanford) 

    • Approach: Uses global word co-occurrence statistics from a large corpus, rather than prediction.
    • Result: Embeddings that capture both local context (nearby words) and global patterns (overall corpus trends).

A later improvement, fastText, added subword information to handle out-of-vocabulary (OOV) words and morphological variations (e.g., "cats" and "cat").

 

Visualizing Word Embeddings

Plotting word vectors using techniques like t-SNE or PCA reveals fascinating patterns:

  • Synonyms Cluster: "happy" and "joyful" group together.
  • Word Groups: Animals, professions, or emotions form distinct regions.
  • Encoded Relationships: Directions in the vector space reflect gender (e.g., "king" to "queen"), plurality, or tense.

This visualization underscores how embeddings encode meaning beyond raw counts.

 

 

 

word2vec_negative_sampling

Why Embeddings Matter

Word embeddings transformed NLP by:

  • Turning Language into Math: Providing a numerical foundation for neural networks.
  • Enabling Generalization: Allowing models to understand similar words (e.g., "dog" and "puppy") without explicit training.
  • Powering Applications: Driving semantic search, sentiment analysis, and classification.
  • Serving as Input: Forming the foundation for LSTMs, GRUs, and eventually transformers.

 

Limitations of Early Embeddings

Despite their power, early word embeddings had a critical flaw: they were static.

  • A word like "bank" had the same vector whether it meant a riverbank or a financial institution, ignoring sentence context.
  • This lack of adaptability limited their use in complex, context-dependent tasks, setting the stage for transformers to introduce contextual embeddings, where a word’s meaning shifts based on its usage.

 

Code Example: Load Pretrained Word2Vec

Here’s how to use a pretrained Word2Vec model with the gensim library to explore word similarities:

from gensim.models import KeyedVectors

# Load pretrained Word2Vec model (Google News vectors)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Find words most similar to "king"
similar_words = model.most_similar('king', topn=5)
print("Words similar to 'king':", similar_words)

# Demonstrate vector arithmetic
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("king - man + woman ≈", result)

Sample Output

Assuming the Google News Word2Vec model is loaded (a 3GB file with 300-dimensional vectors trained on 100 billion words), the output might look like:

Words similar to 'king': [('queen', 0.651095), ('prince', 0.604983), ('monarch', 0.528169), ('throne', 0.521253), ('kingdom', 0.498492)]

king - man + woman ≈ [('queen', 0.765432)]

  • Note: Exact values depend on the model and data. The untrained example from Part 2 is replaced here with a real-world pretrained model for relevance. You’ll need to download the GoogleNews-vectors-negative300.bin file from the gensim website or a similar source.

 

Recap Table

ModelMethodStrength
Word2VecPredictive (context/word)Fast, intuitive
GloVeCount-based (co-occurrence)Global + local semantics
fastTextSubword info (handles OOV)Robust to typos, morphologies

 

Up Next: Part 4 – The Encoder: Understanding Context in Full View

With word embeddings as the input, we now turn to how transformers use the encoder to understand context across an entire sequence. Stay tuned!