K-Nearest Neighbors (KNN)

K-Nearest Neighbors is one of the simplest and most intuitive machine learning algorithms. It makes predictions based on the principle that similar things exist in close proximity.

The Algorithm

To predict for a new point:

Calculate distance to all training points
Find the k nearest neighbors
Aggregate their labels/values

New point (?)        Training data
     ↓               ● ● ■ ●
   [?]    →    Find k=3 nearest:  ●  ●  ■
                                  ↓
                           Majority vote: ●

Classification: Majority vote among k neighbors
Regression: Average (or weighted average) of k neighbors

Distance Metrics

The choice of distance measure matters!

Euclidean Distance (L2)

d(x, y) = √(Σ(xᵢ - yᵢ)²)

The straight-line distance. Most common.

Manhattan Distance (L1)

d(x, y) = Σ|xᵢ - yᵢ|

Sum of absolute differences. Better for high dimensions.

Minkowski Distance

d(x, y) = (Σ|xᵢ - yᵢ|^p)^(1/p)

Generalization: p=1 is Manhattan, p=2 is Euclidean.

Cosine Similarity

sim(x, y) = (x · y) / (||x|| × ||y||)

Measures angle between vectors. Great for text/sparse data.

The Importance of k

k is the key hyperparameter:

k value	Behavior
k = 1	Most flexible, highest variance, sensitive to noise
k = small	Complex decision boundary, potential overfitting
k = large	Smoother boundary, potential underfitting
k = n	Predicts majority class always

Rule of thumb: Start with k = √n (square root of training size).

Tip: Use odd k for binary classification to avoid ties.

Weighted KNN

Not all neighbors are equally relevant. Weight by distance:

prediction = Σ wᵢ × yᵢ / Σ wᵢ

where wᵢ = 1/d(x, xᵢ) or wᵢ = 1/d(x, xᵢ)²

Closer neighbors have more influence.

Feature Scaling is Critical

KNN uses distances, so feature scales matter enormously!

Problem: If age is 0-100 and salary is 0-1,000,000, salary dominates.

Solution: Always normalize features:

Min-max scaling: (x - min) / (max - min)
Standardization: (x - μ) / σ

The Curse of Dimensionality

KNN suffers badly in high dimensions:

Distances become meaningless: All points become equidistant
Sparse data: Need exponentially more data to cover space
Computation explodes: More dimensions = slower

Solutions:

Dimensionality reduction (PCA)
Feature selection
Use for low-dimensional problems

Advantages

Simple: Easy to understand and implement
No training: Just store the data
Naturally handles multi-class: No modification needed
Non-parametric: No assumptions about data distribution
Adapts to new data: Just add points

Disadvantages

Slow prediction: Must scan all training data
Memory intensive: Stores entire dataset
Curse of dimensionality: Fails in high dimensions
Sensitive to scale: Must normalize features
Sensitive to noise: Outliers affect predictions

Making KNN Faster

KD-Trees

Partition space for faster neighbor search.

O(log n) search in low dimensions
Degrades in high dimensions

Ball Trees

Alternative structure, better for high dimensions.

Approximate Nearest Neighbors

Trade accuracy for speed.

Libraries: Annoy, FAISS, ScaNN
Used in production recommendation systems

When to Use KNN

Good for:

Low-dimensional data
When you need a quick baseline
Small datasets
When similar examples should have similar labels

Avoid for:

High-dimensional data (>20 features)
Large datasets (slow)
When features are on different scales (unless you scale)
Real-time predictions on big data

Key Takeaways

KNN predicts based on the k most similar examples
Always scale your features!
Choose k to balance bias and variance
Fails in high dimensions (curse of dimensionality)
Simple but can be slow for large datasets