beginnerFoundations

Learn about cosine similarity - a metric for measuring similarity between vectors, widely used in NLP, recommendation systems, and information retrieval.

similarityvectorsnlpembeddingsinformation-retrieval

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors, providing a similarity score between -1 and 1 that is independent of vector magnitude.

Definition

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

                        = Σᵢ(Aᵢ × Bᵢ) / (√Σᵢ(Aᵢ²) × √Σᵢ(Bᵢ²))

Interpretation

cos(θ) = 1:   Vectors point same direction (identical)
cos(θ) = 0:   Vectors are orthogonal (no similarity)
cos(θ) = -1:  Vectors point opposite directions

Visual Intuition

        B
        ↑
       ╱│
      ╱ │
     ╱  │
    ╱ θ │
   ╱────┘
  A

cos(θ) = similarity

θ = 0°:   cos(0°) = 1    (same direction)
θ = 90°:  cos(90°) = 0   (perpendicular)
θ = 180°: cos(180°) = -1 (opposite)

Implementation

NumPy

import numpy as np

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# Example
vec1 = np.array([1, 2, 3])
vec2 = np.array([4, 5, 6])
sim = cosine_similarity(vec1, vec2)
print(f"Similarity: {sim:.4f}")  # 0.9746

Scikit-learn

from sklearn.metrics.pairwise import cosine_similarity

# For matrices (pairwise similarity)
X = np.array([[1, 2, 3], [4, 5, 6], [1, 0, 0]])
sim_matrix = cosine_similarity(X)
print(sim_matrix)
# [[1.    0.97  0.27]
#  [0.97  1.    0.46]
#  [0.27  0.46  1.  ]]

PyTorch

import torch
import torch.nn.functional as F

vec1 = torch.tensor([1., 2., 3.])
vec2 = torch.tensor([4., 5., 6.])

# Method 1: F.cosine_similarity
sim = F.cosine_similarity(vec1.unsqueeze(0), vec2.unsqueeze(0))

# Method 2: Manual
sim = torch.dot(vec1, vec2) / (vec1.norm() * vec2.norm())

Cosine vs Euclidean Distance

         ○ B (short vector)      ○ B' (scaled B)
        ╱                       ╱
       ╱                       ╱
      ╱                       ╱
     ╱                       ╱
    ○─────────────○─────────○
    A            A'          

Cosine: sim(A,B) = sim(A,B') = sim(A',B') = 1
        (direction matters, not magnitude)

Euclidean: dist(A,B) ≠ dist(A,B')
           (magnitude matters)

When to Use Each

MetricBest For
CosineText similarity, embeddings, high-dimensional
EuclideanSpatial data, clustering, when magnitude matters

Applications

1. Document Similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "Machine learning is great",
    "Deep learning is a subset of machine learning",
    "I love pizza"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Similarity between first two documents
sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
print(f"Doc 1-2 similarity: {sim[0][0]:.4f}")  # High

sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[2:3])
print(f"Doc 1-3 similarity: {sim[0][0]:.4f}")  # Low

2. Semantic Search with Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "How to train a neural network",
    "Deep learning tutorial for beginners",
    "Best pizza recipes"
]
query = "machine learning training guide"

# Encode all texts
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# Find most similar
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
best_match = documents[similarities.argmax()]
print(f"Best match: {best_match}")  # Deep learning tutorial...

3. Recommendation Systems

# User-item similarity based on ratings
user_ratings = np.array([
    [5, 4, 0, 0, 1],  # User 1
    [4, 5, 0, 0, 2],  # User 2
    [0, 0, 5, 4, 5],  # User 3
])

# Find similar users
user_similarities = cosine_similarity(user_ratings)
print("User similarities:")
print(user_similarities)
# User 1 and 2 are similar (both like items 1,2)
# User 3 is different (likes items 3,4,5)

4. Image Similarity

import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Extract features with pre-trained model
resnet = models.resnet50(pretrained=True)
resnet = torch.nn.Sequential(*list(resnet.children())[:-1])  # Remove classifier
resnet.eval()

def get_image_embedding(image_path):
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
    image = Image.open(image_path).convert('RGB')
    tensor = transform(image).unsqueeze(0)
    with torch.no_grad():
        embedding = resnet(tensor).squeeze()
    return embedding

# Compare images
emb1 = get_image_embedding('cat1.jpg')
emb2 = get_image_embedding('cat2.jpg')
similarity = F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))

Cosine Distance

cosine_distance = 1 - cosine_similarity

Range: [0, 2]
  0: identical
  1: orthogonal
  2: opposite
from sklearn.metrics.pairwise import cosine_distances

distances = cosine_distances(X)
# distances[i,j] = 1 - cosine_similarity(X[i], X[j])

Soft Cosine Similarity

Accounts for word similarity in document comparison:

from gensim.matutils import softcossim
from gensim import corpora

# When words like "car" and "automobile" should be considered similar
# Uses word embeddings to compute soft similarity matrix

Normalized Vectors

For unit vectors (L2 normalized), cosine similarity = dot product:

# Normalize vectors
vec1_norm = vec1 / np.linalg.norm(vec1)
vec2_norm = vec2 / np.linalg.norm(vec2)

# Now dot product = cosine similarity
similarity = np.dot(vec1_norm, vec2_norm)  # Faster!

# Many embedding models return normalized vectors
# CLIP, sentence-transformers with normalize_embeddings=True

Batch Computation

# Efficient batch cosine similarity
def batch_cosine_similarity(queries, documents):
    # Normalize
    queries_norm = queries / np.linalg.norm(queries, axis=1, keepdims=True)
    docs_norm = documents / np.linalg.norm(documents, axis=1, keepdims=True)
    
    # Matrix multiplication for all pairs
    similarities = queries_norm @ docs_norm.T
    return similarities

# [n_queries, n_docs] similarity matrix
sims = batch_cosine_similarity(query_embeddings, doc_embeddings)

Key Takeaways

  1. Cosine similarity measures angle between vectors (ignores magnitude)
  2. Range: -1 (opposite) to 1 (identical), 0 = orthogonal
  3. Widely used for text, embeddings, and high-dimensional data
  4. For normalized vectors, cosine similarity = dot product
  5. Use cosine distance (1 - similarity) for clustering algorithms
  6. Efficient for sparse data (TF-IDF) and dense embeddings