Word Embeddings

Word embeddings are dense vector representations of words where semantically similar words have similar vectors. They're the foundation of modern NLP.

The Problem with One-Hot Encoding

Traditional approach: represent each word as a sparse vector.

"cat"  → [1, 0, 0, 0, 0, ...] (10,000+ dimensions)
"dog"  → [0, 1, 0, 0, 0, ...]
"car"  → [0, 0, 1, 0, 0, ...]

Problems:

High dimensional: Vocabulary size = dimensions
Sparse: Mostly zeros
No semantic meaning: "cat" and "dog" as different as "cat" and "car"

The Solution: Dense Embeddings

Map words to low-dimensional dense vectors:

"cat"  → [0.2, -0.4, 0.7, 0.1, ...]  (300 dimensions)
"dog"  → [0.3, -0.3, 0.6, 0.2, ...]  (similar to cat!)
"car"  → [-0.5, 0.8, -0.2, 0.4, ...] (different)

Similar words → similar vectors.

Word2Vec

The breakthrough model from Google (2013).

Skip-gram

Predict context words from center word:

"The quick brown [fox] jumps over"

Given "fox", predict: "quick", "brown", "jumps", "over"

Loss pushes center word close to context words.

CBOW (Continuous Bag of Words)

Predict center word from context:

Given: "quick", "brown", "jumps", "over"
Predict: "fox"

Faster than skip-gram, but skip-gram often better.

Training Tricks

Negative Sampling: Instead of softmax over all words, sample negative examples:

Positive: (fox, brown) → should be similar
Negative: (fox, random_word) → should be dissimilar

Subword Information: FastText extends Word2Vec with character n-grams:

"where" = "<wh", "whe", "her", "ere", "re>"
Handles rare words and misspellings

GloVe (Global Vectors)

Stanford's alternative approach (2014).

Key insight: Word co-occurrence statistics contain semantic info.

Objective: wᵢ · wⱼ ≈ log(co-occurrence count)

Combines:

Global statistics (like traditional methods)
Local context (like neural methods)

Famous Properties

Analogies

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome

Vector arithmetic captures relationships!

Clustering

Similar words cluster together:

Animals: dog, cat, horse...
Countries: France, Germany, Italy...

Similarity

cosine_similarity("happy", "joyful") > cosine_similarity("happy", "sad")

Limitations of Static Embeddings

Polysemy Problem

"bank" has one vector, but multiple meanings:

River bank
Financial bank

Solution: Contextual embeddings (BERT, etc.)

Out-of-Vocabulary

No embedding for words not in training vocabulary.

Solution: Subword models (FastText, BPE)

Bias

Embeddings learn societal biases from training data:

"doctor" closer to "man" than "woman"
Racial and gender stereotypes encoded

Using Embeddings

Pretrained Embeddings

Download and use directly:

Word2Vec (Google News, 3M words, 300d)
GloVe (Wikipedia, various sizes)
FastText (157 languages)

Fine-tuning

Start with pretrained, update during training:

Good when you have domain-specific data
Risk: Catastrophic forgetting of general knowledge

From Scratch

Train embeddings as part of model:

Learn task-specific representations
Needs lots of data

Embeddings in Deep Learning

# PyTorch embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# Input: word indices [batch, seq_len]
# Output: embeddings [batch, seq_len, embedding_dim]
vectors = embedding(word_indices)

Embedding layer = lookup table (really just matrix multiplication with one-hot).

Beyond Words

Embedding approach works for anything:

Characters: For morphology
Subwords: BPE, SentencePiece
Sentences: Doc2Vec, Sentence-BERT
Items: Product embeddings
Users: User embeddings
Graphs: Node2Vec

Modern Context: Still Relevant?

With BERT and GPT, do we need Word2Vec?

Yes, for:

Efficient baselines
Resource-constrained settings
Understanding foundations
Initialization for smaller models

No, when:

You need contextual understanding
State-of-the-art performance required
Computational resources available

Key Takeaways

Embeddings map words to dense vectors capturing semantics
Word2Vec learns from predicting context (skip-gram/CBOW)
Similar words have similar vectors
Vector arithmetic captures relationships (king - man + woman = queen)
Static embeddings can't handle polysemy
Contextual embeddings (BERT) solve this but are more expensive