K-Nearest Neighbors (KNN)
K-Nearest Neighbors is one of the simplest and most intuitive machine learning algorithms. It makes predictions based on the principle that similar things exist in close proximity.
The Algorithm
To predict for a new point:
- Calculate distance to all training points
- Find the k nearest neighbors
- Aggregate their labels/values
New point (?) Training data
↓ ● ● ■ ●
[?] → Find k=3 nearest: ● ● ■
↓
Majority vote: ●
- Classification: Majority vote among k neighbors
- Regression: Average (or weighted average) of k neighbors
Distance Metrics
The choice of distance measure matters!
Euclidean Distance (L2)
d(x, y) = √(Σ(xᵢ - yᵢ)²)
The straight-line distance. Most common.
Manhattan Distance (L1)
d(x, y) = Σ|xᵢ - yᵢ|
Sum of absolute differences. Better for high dimensions.
Minkowski Distance
d(x, y) = (Σ|xᵢ - yᵢ|^p)^(1/p)
Generalization: p=1 is Manhattan, p=2 is Euclidean.
Cosine Similarity
sim(x, y) = (x · y) / (||x|| × ||y||)
Measures angle between vectors. Great for text/sparse data.
The Importance of k
k is the key hyperparameter:
| k value | Behavior |
|---|---|
| k = 1 | Most flexible, highest variance, sensitive to noise |
| k = small | Complex decision boundary, potential overfitting |
| k = large | Smoother boundary, potential underfitting |
| k = n | Predicts majority class always |
Rule of thumb: Start with k = √n (square root of training size).
Tip: Use odd k for binary classification to avoid ties.
Weighted KNN
Not all neighbors are equally relevant. Weight by distance:
prediction = Σ wᵢ × yᵢ / Σ wᵢ
where wᵢ = 1/d(x, xᵢ) or wᵢ = 1/d(x, xᵢ)²
Closer neighbors have more influence.
Feature Scaling is Critical
KNN uses distances, so feature scales matter enormously!
Problem: If age is 0-100 and salary is 0-1,000,000, salary dominates.
Solution: Always normalize features:
- Min-max scaling: (x - min) / (max - min)
- Standardization: (x - μ) / σ
The Curse of Dimensionality
KNN suffers badly in high dimensions:
- Distances become meaningless: All points become equidistant
- Sparse data: Need exponentially more data to cover space
- Computation explodes: More dimensions = slower
Solutions:
- Dimensionality reduction (PCA)
- Feature selection
- Use for low-dimensional problems
Advantages
- Simple: Easy to understand and implement
- No training: Just store the data
- Naturally handles multi-class: No modification needed
- Non-parametric: No assumptions about data distribution
- Adapts to new data: Just add points
Disadvantages
- Slow prediction: Must scan all training data
- Memory intensive: Stores entire dataset
- Curse of dimensionality: Fails in high dimensions
- Sensitive to scale: Must normalize features
- Sensitive to noise: Outliers affect predictions
Making KNN Faster
KD-Trees
Partition space for faster neighbor search.
- O(log n) search in low dimensions
- Degrades in high dimensions
Ball Trees
Alternative structure, better for high dimensions.
Approximate Nearest Neighbors
Trade accuracy for speed.
- Libraries: Annoy, FAISS, ScaNN
- Used in production recommendation systems
When to Use KNN
Good for:
- Low-dimensional data
- When you need a quick baseline
- Small datasets
- When similar examples should have similar labels
Avoid for:
- High-dimensional data (>20 features)
- Large datasets (slow)
- When features are on different scales (unless you scale)
- Real-time predictions on big data
Key Takeaways
- KNN predicts based on the k most similar examples
- Always scale your features!
- Choose k to balance bias and variance
- Fails in high dimensions (curse of dimensionality)
- Simple but can be slow for large datasets