Word Embeddings
Word embeddings are dense vector representations of words where semantically similar words have similar vectors. They're the foundation of modern NLP.
The Problem with One-Hot Encoding
Traditional approach: represent each word as a sparse vector.
"cat" → [1, 0, 0, 0, 0, ...] (10,000+ dimensions)
"dog" → [0, 1, 0, 0, 0, ...]
"car" → [0, 0, 1, 0, 0, ...]
Problems:
- High dimensional: Vocabulary size = dimensions
- Sparse: Mostly zeros
- No semantic meaning: "cat" and "dog" as different as "cat" and "car"
The Solution: Dense Embeddings
Map words to low-dimensional dense vectors:
"cat" → [0.2, -0.4, 0.7, 0.1, ...] (300 dimensions)
"dog" → [0.3, -0.3, 0.6, 0.2, ...] (similar to cat!)
"car" → [-0.5, 0.8, -0.2, 0.4, ...] (different)
Similar words → similar vectors.
Word2Vec
The breakthrough model from Google (2013).
Skip-gram
Predict context words from center word:
"The quick brown [fox] jumps over"
Given "fox", predict: "quick", "brown", "jumps", "over"
Loss pushes center word close to context words.
CBOW (Continuous Bag of Words)
Predict center word from context:
Given: "quick", "brown", "jumps", "over"
Predict: "fox"
Faster than skip-gram, but skip-gram often better.
Training Tricks
Negative Sampling: Instead of softmax over all words, sample negative examples:
- Positive: (fox, brown) → should be similar
- Negative: (fox, random_word) → should be dissimilar
Subword Information: FastText extends Word2Vec with character n-grams:
- "where" = "<wh", "whe", "her", "ere", "re>"
- Handles rare words and misspellings
GloVe (Global Vectors)
Stanford's alternative approach (2014).
Key insight: Word co-occurrence statistics contain semantic info.
Objective: wᵢ · wⱼ ≈ log(co-occurrence count)
Combines:
- Global statistics (like traditional methods)
- Local context (like neural methods)
Famous Properties
Analogies
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
Vector arithmetic captures relationships!
Clustering
Similar words cluster together:
- Animals: dog, cat, horse...
- Countries: France, Germany, Italy...
Similarity
cosine_similarity("happy", "joyful") > cosine_similarity("happy", "sad")
Limitations of Static Embeddings
Polysemy Problem
"bank" has one vector, but multiple meanings:
- River bank
- Financial bank
Solution: Contextual embeddings (BERT, etc.)
Out-of-Vocabulary
No embedding for words not in training vocabulary.
Solution: Subword models (FastText, BPE)
Bias
Embeddings learn societal biases from training data:
- "doctor" closer to "man" than "woman"
- Racial and gender stereotypes encoded
Using Embeddings
Pretrained Embeddings
Download and use directly:
- Word2Vec (Google News, 3M words, 300d)
- GloVe (Wikipedia, various sizes)
- FastText (157 languages)
Fine-tuning
Start with pretrained, update during training:
- Good when you have domain-specific data
- Risk: Catastrophic forgetting of general knowledge
From Scratch
Train embeddings as part of model:
- Learn task-specific representations
- Needs lots of data
Embeddings in Deep Learning
# PyTorch embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)
# Input: word indices [batch, seq_len]
# Output: embeddings [batch, seq_len, embedding_dim]
vectors = embedding(word_indices)
Embedding layer = lookup table (really just matrix multiplication with one-hot).
Beyond Words
Embedding approach works for anything:
- Characters: For morphology
- Subwords: BPE, SentencePiece
- Sentences: Doc2Vec, Sentence-BERT
- Items: Product embeddings
- Users: User embeddings
- Graphs: Node2Vec
Modern Context: Still Relevant?
With BERT and GPT, do we need Word2Vec?
Yes, for:
- Efficient baselines
- Resource-constrained settings
- Understanding foundations
- Initialization for smaller models
No, when:
- You need contextual understanding
- State-of-the-art performance required
- Computational resources available
Key Takeaways
- Embeddings map words to dense vectors capturing semantics
- Word2Vec learns from predicting context (skip-gram/CBOW)
- Similar words have similar vectors
- Vector arithmetic captures relationships (king - man + woman = queen)
- Static embeddings can't handle polysemy
- Contextual embeddings (BERT) solve this but are more expensive