Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors, providing a similarity score between -1 and 1 that is independent of vector magnitude.
Definition
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
= Σᵢ(Aᵢ × Bᵢ) / (√Σᵢ(Aᵢ²) × √Σᵢ(Bᵢ²))
Interpretation
cos(θ) = 1: Vectors point same direction (identical)
cos(θ) = 0: Vectors are orthogonal (no similarity)
cos(θ) = -1: Vectors point opposite directions
Visual Intuition
B
↑
╱│
╱ │
╱ │
╱ θ │
╱────┘
A
cos(θ) = similarity
θ = 0°: cos(0°) = 1 (same direction)
θ = 90°: cos(90°) = 0 (perpendicular)
θ = 180°: cos(180°) = -1 (opposite)
Implementation
NumPy
import numpy as np
def cosine_similarity(a, b):
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
# Example
vec1 = np.array([1, 2, 3])
vec2 = np.array([4, 5, 6])
sim = cosine_similarity(vec1, vec2)
print(f"Similarity: {sim:.4f}") # 0.9746
Scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
# For matrices (pairwise similarity)
X = np.array([[1, 2, 3], [4, 5, 6], [1, 0, 0]])
sim_matrix = cosine_similarity(X)
print(sim_matrix)
# [[1. 0.97 0.27]
# [0.97 1. 0.46]
# [0.27 0.46 1. ]]
PyTorch
import torch
import torch.nn.functional as F
vec1 = torch.tensor([1., 2., 3.])
vec2 = torch.tensor([4., 5., 6.])
# Method 1: F.cosine_similarity
sim = F.cosine_similarity(vec1.unsqueeze(0), vec2.unsqueeze(0))
# Method 2: Manual
sim = torch.dot(vec1, vec2) / (vec1.norm() * vec2.norm())
Cosine vs Euclidean Distance
○ B (short vector) ○ B' (scaled B)
╱ ╱
╱ ╱
╱ ╱
╱ ╱
○─────────────○─────────○
A A'
Cosine: sim(A,B) = sim(A,B') = sim(A',B') = 1
(direction matters, not magnitude)
Euclidean: dist(A,B) ≠ dist(A,B')
(magnitude matters)
When to Use Each
| Metric | Best For |
|---|---|
| Cosine | Text similarity, embeddings, high-dimensional |
| Euclidean | Spatial data, clustering, when magnitude matters |
Applications
1. Document Similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"Machine learning is great",
"Deep learning is a subset of machine learning",
"I love pizza"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Similarity between first two documents
sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
print(f"Doc 1-2 similarity: {sim[0][0]:.4f}") # High
sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[2:3])
print(f"Doc 1-3 similarity: {sim[0][0]:.4f}") # Low
2. Semantic Search with Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"How to train a neural network",
"Deep learning tutorial for beginners",
"Best pizza recipes"
]
query = "machine learning training guide"
# Encode all texts
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)
# Find most similar
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
best_match = documents[similarities.argmax()]
print(f"Best match: {best_match}") # Deep learning tutorial...
3. Recommendation Systems
# User-item similarity based on ratings
user_ratings = np.array([
[5, 4, 0, 0, 1], # User 1
[4, 5, 0, 0, 2], # User 2
[0, 0, 5, 4, 5], # User 3
])
# Find similar users
user_similarities = cosine_similarity(user_ratings)
print("User similarities:")
print(user_similarities)
# User 1 and 2 are similar (both like items 1,2)
# User 3 is different (likes items 3,4,5)
4. Image Similarity
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
# Extract features with pre-trained model
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*list(resnet.children())[:-1]) # Remove classifier
resnet.eval()
def get_image_embedding(image_path):
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open(image_path).convert('RGB')
tensor = transform(image).unsqueeze(0)
with torch.no_grad():
embedding = resnet(tensor).squeeze()
return embedding
# Compare images
emb1 = get_image_embedding('cat1.jpg')
emb2 = get_image_embedding('cat2.jpg')
similarity = F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))
Cosine Distance
cosine_distance = 1 - cosine_similarity
Range: [0, 2]
0: identical
1: orthogonal
2: opposite
from sklearn.metrics.pairwise import cosine_distances
distances = cosine_distances(X)
# distances[i,j] = 1 - cosine_similarity(X[i], X[j])
Soft Cosine Similarity
Accounts for word similarity in document comparison:
from gensim.matutils import softcossim
from gensim import corpora
# When words like "car" and "automobile" should be considered similar
# Uses word embeddings to compute soft similarity matrix
Normalized Vectors
For unit vectors (L2 normalized), cosine similarity = dot product:
# Normalize vectors
vec1_norm = vec1 / np.linalg.norm(vec1)
vec2_norm = vec2 / np.linalg.norm(vec2)
# Now dot product = cosine similarity
similarity = np.dot(vec1_norm, vec2_norm) # Faster!
# Many embedding models return normalized vectors
# CLIP, sentence-transformers with normalize_embeddings=True
Batch Computation
# Efficient batch cosine similarity
def batch_cosine_similarity(queries, documents):
# Normalize
queries_norm = queries / np.linalg.norm(queries, axis=1, keepdims=True)
docs_norm = documents / np.linalg.norm(documents, axis=1, keepdims=True)
# Matrix multiplication for all pairs
similarities = queries_norm @ docs_norm.T
return similarities
# [n_queries, n_docs] similarity matrix
sims = batch_cosine_similarity(query_embeddings, doc_embeddings)
Key Takeaways
- Cosine similarity measures angle between vectors (ignores magnitude)
- Range: -1 (opposite) to 1 (identical), 0 = orthogonal
- Widely used for text, embeddings, and high-dimensional data
- For normalized vectors, cosine similarity = dot product
- Use cosine distance (1 - similarity) for clustering algorithms
- Efficient for sparse data (TF-IDF) and dense embeddings