Embeddings

Embeddings are learned dense vector representations of discrete objects. They map categorical data into continuous vector spaces where similar items are nearby.

What Are Embeddings?

From Sparse to Dense

One-hot encoding (sparse):
"cat" → [1, 0, 0, 0, ..., 0]  (vocab_size dimensions)
"dog" → [0, 1, 0, 0, ..., 0]

Embedding (dense):
"cat" → [0.2, -0.5, 0.8, 0.1]  (embedding_dim dimensions)
"dog" → [0.3, -0.4, 0.7, 0.2]  # Similar to cat!

Key Properties

Dense: Most values non-zero
Low-dimensional: 50-1000 vs vocabulary size
Learned: Trained from data
Semantic: Similar items → similar vectors

How Embeddings Work

Embedding Layer

Simply a lookup table:

class Embedding:
    def __init__(self, vocab_size, embed_dim):
        self.weights = random(vocab_size, embed_dim)
    
    def forward(self, indices):
        return self.weights[indices]  # Just lookup!

Training

Gradients flow back through embedding layer:

Loss → ... → Embedding weights updated

Embeddings learn representations useful for the task.

Types of Embeddings

Word Embeddings

Vector representations of words:

Word2Vec: Skip-gram, CBOW
GloVe: Global Vectors
FastText: Subword embeddings

Contextual Embeddings

Same word, different vectors based on context:

BERT: Bidirectional context
GPT: Left-to-right context

"bank" (river) → [0.2, 0.8, ...]
"bank" (money) → [0.9, 0.1, ...]

Item Embeddings

Products, movies, users:

Recommendation systems
Learned from interactions

Graph Embeddings

Nodes in networks:

Node2Vec
GraphSAGE

Image Embeddings

From CNN encoders:

ResNet features
CLIP embeddings

Vector Arithmetic

Embeddings capture relationships:

vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Germany") ≈ vec("Berlin")

Similarity Measures

Cosine Similarity

cos_sim(a, b) = (a · b) / (||a|| × ||b||)

Range: -1 to 1. Most common for embeddings.

Euclidean Distance

dist(a, b) = ||a - b||

Smaller = more similar.

Dot Product

a · b = Σ aᵢbᵢ

Fast, but magnitude-sensitive.

Training Embeddings

As Part of Model

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.classifier = nn.Linear(embed_dim, num_classes)
    
    def forward(self, x):
        embedded = self.embedding(x)  # [batch, seq, embed]
        pooled = embedded.mean(dim=1)  # [batch, embed]
        return self.classifier(pooled)

Pre-trained Embeddings

# Load pre-trained
embedding.weight = torch.tensor(pretrained_vectors)

# Optionally freeze
embedding.weight.requires_grad = False

Contrastive Learning

Train to make similar items close:

Loss = distance(anchor, positive) - distance(anchor, negative) + margin

Embedding Dimension

How to Choose

Application	Typical Dimension
Word embeddings	100-300
Sentence embeddings	384-768
Recommendation	32-128
Large language models	768-4096

Rule of Thumb

embed_dim ≈ 4th root of vocab_size

# Or use power of 2 for efficiency
64, 128, 256, 512

Using Embeddings

Semantic Search

def search(query, documents):
    query_emb = embed(query)
    doc_embs = [embed(d) for d in documents]
    similarities = [cosine_sim(query_emb, d) for d in doc_embs]
    return documents[argmax(similarities)]

Clustering

from sklearn.cluster import KMeans

embeddings = [get_embedding(item) for item in items]
kmeans = KMeans(n_clusters=10).fit(embeddings)

Classification Features

X = np.array([get_embedding(text) for text in texts])
model.fit(X, labels)

Retrieval (RAG)

# Index
index.add(document_embeddings)

# Query
results = index.search(query_embedding, k=5)

Common Issues

Out-of-Vocabulary (OOV)

Unknown words have no embedding.

Solutions:

Use <UNK> token
Subword tokenization (BPE, WordPiece)
Character-level embeddings

Cold Start

New items have no learned embedding.

Solutions:

Content-based initial embedding
Average of similar items
Frequent retraining

Embedding Drift

Meaning changes over time.

Solutions:

Periodic retraining
Incremental updates

Vector Databases

For efficient similarity search:

Pinecone: Managed service
Weaviate: Open source
FAISS: Facebook's library
Chroma: Lightweight
Qdrant: Rust-based

import faiss

index = faiss.IndexFlatIP(embed_dim)  # Inner product
index.add(embeddings)
distances, indices = index.search(query_embedding, k=10)

Key Takeaways

Embeddings map discrete items to dense vectors
Similar items have similar vectors
Learned end-to-end or pre-trained
Cosine similarity is standard measure
Enable semantic search, recommendations, clustering
Vector databases enable efficient retrieval at scale

Embeddings

What Are Embeddings?

From Sparse to Dense

Key Properties

How Embeddings Work

Embedding Layer

Training

Types of Embeddings

Word Embeddings

Contextual Embeddings

Item Embeddings

Graph Embeddings

Image Embeddings

Vector Arithmetic

Similarity Measures

Cosine Similarity

Euclidean Distance

Dot Product

Training Embeddings

As Part of Model

Pre-trained Embeddings

Contrastive Learning

Embedding Dimension

How to Choose

Rule of Thumb

Using Embeddings

Semantic Search

Clustering

Classification Features

Retrieval (RAG)

Common Issues

Out-of-Vocabulary (OOV)

Cold Start

Embedding Drift

Vector Databases

Key Takeaways

Related Concepts

Practice Questions