beginnerNatural Language Processing

Understand word embeddings - dense vector representations that capture semantic meaning, enabling neural networks to process text.

embeddingsword2vecgloverepresentation-learning

Word Embeddings

Word embeddings are dense vector representations of words where semantically similar words have similar vectors. They're the foundation of modern NLP.

The Problem with One-Hot Encoding

Traditional approach: represent each word as a sparse vector.

"cat"  → [1, 0, 0, 0, 0, ...] (10,000+ dimensions)
"dog"  → [0, 1, 0, 0, 0, ...]
"car"  → [0, 0, 1, 0, 0, ...]

Problems:

  • High dimensional: Vocabulary size = dimensions
  • Sparse: Mostly zeros
  • No semantic meaning: "cat" and "dog" as different as "cat" and "car"

The Solution: Dense Embeddings

Map words to low-dimensional dense vectors:

"cat"  → [0.2, -0.4, 0.7, 0.1, ...]  (300 dimensions)
"dog"  → [0.3, -0.3, 0.6, 0.2, ...]  (similar to cat!)
"car"  → [-0.5, 0.8, -0.2, 0.4, ...] (different)

Similar words → similar vectors.

Word2Vec

The breakthrough model from Google (2013).

Skip-gram

Predict context words from center word:

"The quick brown [fox] jumps over"

Given "fox", predict: "quick", "brown", "jumps", "over"

Loss pushes center word close to context words.

CBOW (Continuous Bag of Words)

Predict center word from context:

Given: "quick", "brown", "jumps", "over"
Predict: "fox"

Faster than skip-gram, but skip-gram often better.

Training Tricks

Negative Sampling: Instead of softmax over all words, sample negative examples:

  • Positive: (fox, brown) → should be similar
  • Negative: (fox, random_word) → should be dissimilar

Subword Information: FastText extends Word2Vec with character n-grams:

  • "where" = "<wh", "whe", "her", "ere", "re>"
  • Handles rare words and misspellings

GloVe (Global Vectors)

Stanford's alternative approach (2014).

Key insight: Word co-occurrence statistics contain semantic info.

Objective: wᵢ · wⱼ ≈ log(co-occurrence count)

Combines:

  • Global statistics (like traditional methods)
  • Local context (like neural methods)

Famous Properties

Analogies

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome

Vector arithmetic captures relationships!

Clustering

Similar words cluster together:

  • Animals: dog, cat, horse...
  • Countries: France, Germany, Italy...

Similarity

cosine_similarity("happy", "joyful") > cosine_similarity("happy", "sad")

Limitations of Static Embeddings

Polysemy Problem

"bank" has one vector, but multiple meanings:

  • River bank
  • Financial bank

Solution: Contextual embeddings (BERT, etc.)

Out-of-Vocabulary

No embedding for words not in training vocabulary.

Solution: Subword models (FastText, BPE)

Bias

Embeddings learn societal biases from training data:

  • "doctor" closer to "man" than "woman"
  • Racial and gender stereotypes encoded

Using Embeddings

Pretrained Embeddings

Download and use directly:

  • Word2Vec (Google News, 3M words, 300d)
  • GloVe (Wikipedia, various sizes)
  • FastText (157 languages)

Fine-tuning

Start with pretrained, update during training:

  • Good when you have domain-specific data
  • Risk: Catastrophic forgetting of general knowledge

From Scratch

Train embeddings as part of model:

  • Learn task-specific representations
  • Needs lots of data

Embeddings in Deep Learning

# PyTorch embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# Input: word indices [batch, seq_len]
# Output: embeddings [batch, seq_len, embedding_dim]
vectors = embedding(word_indices)

Embedding layer = lookup table (really just matrix multiplication with one-hot).

Beyond Words

Embedding approach works for anything:

  • Characters: For morphology
  • Subwords: BPE, SentencePiece
  • Sentences: Doc2Vec, Sentence-BERT
  • Items: Product embeddings
  • Users: User embeddings
  • Graphs: Node2Vec

Modern Context: Still Relevant?

With BERT and GPT, do we need Word2Vec?

Yes, for:

  • Efficient baselines
  • Resource-constrained settings
  • Understanding foundations
  • Initialization for smaller models

No, when:

  • You need contextual understanding
  • State-of-the-art performance required
  • Computational resources available

Key Takeaways

  1. Embeddings map words to dense vectors capturing semantics
  2. Word2Vec learns from predicting context (skip-gram/CBOW)
  3. Similar words have similar vectors
  4. Vector arithmetic captures relationships (king - man + woman = queen)
  5. Static embeddings can't handle polysemy
  6. Contextual embeddings (BERT) solve this but are more expensive