Embeddings
Embeddings are learned dense vector representations of discrete objects. They map categorical data into continuous vector spaces where similar items are nearby.
What Are Embeddings?
From Sparse to Dense
One-hot encoding (sparse):
"cat" → [1, 0, 0, 0, ..., 0] (vocab_size dimensions)
"dog" → [0, 1, 0, 0, ..., 0]
Embedding (dense):
"cat" → [0.2, -0.5, 0.8, 0.1] (embedding_dim dimensions)
"dog" → [0.3, -0.4, 0.7, 0.2] # Similar to cat!
Key Properties
- Dense: Most values non-zero
- Low-dimensional: 50-1000 vs vocabulary size
- Learned: Trained from data
- Semantic: Similar items → similar vectors
How Embeddings Work
Embedding Layer
Simply a lookup table:
class Embedding:
def __init__(self, vocab_size, embed_dim):
self.weights = random(vocab_size, embed_dim)
def forward(self, indices):
return self.weights[indices] # Just lookup!
Training
Gradients flow back through embedding layer:
Loss → ... → Embedding weights updated
Embeddings learn representations useful for the task.
Types of Embeddings
Word Embeddings
Vector representations of words:
- Word2Vec: Skip-gram, CBOW
- GloVe: Global Vectors
- FastText: Subword embeddings
Contextual Embeddings
Same word, different vectors based on context:
- BERT: Bidirectional context
- GPT: Left-to-right context
"bank" (river) → [0.2, 0.8, ...]
"bank" (money) → [0.9, 0.1, ...]
Item Embeddings
Products, movies, users:
- Recommendation systems
- Learned from interactions
Graph Embeddings
Nodes in networks:
- Node2Vec
- GraphSAGE
Image Embeddings
From CNN encoders:
- ResNet features
- CLIP embeddings
Vector Arithmetic
Embeddings capture relationships:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Germany") ≈ vec("Berlin")
Similarity Measures
Cosine Similarity
cos_sim(a, b) = (a · b) / (||a|| × ||b||)
Range: -1 to 1. Most common for embeddings.
Euclidean Distance
dist(a, b) = ||a - b||
Smaller = more similar.
Dot Product
a · b = Σ aᵢbᵢ
Fast, but magnitude-sensitive.
Training Embeddings
As Part of Model
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, num_classes):
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.classifier = nn.Linear(embed_dim, num_classes)
def forward(self, x):
embedded = self.embedding(x) # [batch, seq, embed]
pooled = embedded.mean(dim=1) # [batch, embed]
return self.classifier(pooled)
Pre-trained Embeddings
# Load pre-trained
embedding.weight = torch.tensor(pretrained_vectors)
# Optionally freeze
embedding.weight.requires_grad = False
Contrastive Learning
Train to make similar items close:
Loss = distance(anchor, positive) - distance(anchor, negative) + margin
Embedding Dimension
How to Choose
| Application | Typical Dimension |
|---|---|
| Word embeddings | 100-300 |
| Sentence embeddings | 384-768 |
| Recommendation | 32-128 |
| Large language models | 768-4096 |
Rule of Thumb
embed_dim ≈ 4th root of vocab_size
# Or use power of 2 for efficiency
64, 128, 256, 512
Using Embeddings
Semantic Search
def search(query, documents):
query_emb = embed(query)
doc_embs = [embed(d) for d in documents]
similarities = [cosine_sim(query_emb, d) for d in doc_embs]
return documents[argmax(similarities)]
Clustering
from sklearn.cluster import KMeans
embeddings = [get_embedding(item) for item in items]
kmeans = KMeans(n_clusters=10).fit(embeddings)
Classification Features
X = np.array([get_embedding(text) for text in texts])
model.fit(X, labels)
Retrieval (RAG)
# Index
index.add(document_embeddings)
# Query
results = index.search(query_embedding, k=5)
Common Issues
Out-of-Vocabulary (OOV)
Unknown words have no embedding.
Solutions:
- Use
<UNK>token - Subword tokenization (BPE, WordPiece)
- Character-level embeddings
Cold Start
New items have no learned embedding.
Solutions:
- Content-based initial embedding
- Average of similar items
- Frequent retraining
Embedding Drift
Meaning changes over time.
Solutions:
- Periodic retraining
- Incremental updates
Vector Databases
For efficient similarity search:
- Pinecone: Managed service
- Weaviate: Open source
- FAISS: Facebook's library
- Chroma: Lightweight
- Qdrant: Rust-based
import faiss
index = faiss.IndexFlatIP(embed_dim) # Inner product
index.add(embeddings)
distances, indices = index.search(query_embedding, k=10)
Key Takeaways
- Embeddings map discrete items to dense vectors
- Similar items have similar vectors
- Learned end-to-end or pre-trained
- Cosine similarity is standard measure
- Enable semantic search, recommendations, clustering
- Vector databases enable efficient retrieval at scale