Support Vector Machines
Support Vector Machines (SVMs) are powerful classifiers that find the optimal hyperplane separating different classes. They're known for working well in high-dimensional spaces and with the famous "kernel trick."
The Core Idea
Find the hyperplane that maximizes the margin - the distance to the nearest points from each class.
Class A (●) Class B (■)
● ■
● | margin | ■
● |←————————→| ■
● | | ■
● | | ■
support→ |——————————| ←support
vectors | hyperplane| vectors
The points closest to the boundary are called support vectors.
Why Maximize the Margin?
- Better generalization: Larger margin = more robust to noise
- Unique solution: Only one maximum margin hyperplane
- Sparse solution: Decision depends only on support vectors
Mathematical Formulation
Hard Margin SVM
For linearly separable data:
Maximize: margin = 2/||w||
Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i
Where w is the normal to the hyperplane and b is the bias.
Soft Margin SVM
Real data isn't perfectly separable. Allow some misclassification:
Minimize: ½||w||² + C Σ ξᵢ
Subject to: yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ
Where:
- ξᵢ (slack variables) measure misclassification
- C controls the tradeoff between margin and errors
The Kernel Trick
What if data isn't linearly separable? Map to higher dimensions where it is!
Problem: Computing in high dimensions is expensive.
Solution: The kernel trick. We only need dot products, and:
K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ)
A kernel function computes the dot product in high-dimensional space without actually transforming the data.
Common Kernels
Linear Kernel
K(x, y) = x · y
No transformation. Use when data is linearly separable.
Polynomial Kernel
K(x, y) = (γ × x·y + r)^d
Captures feature interactions up to degree d.
RBF (Gaussian) Kernel
K(x, y) = exp(-γ × ||x - y||²)
Most popular. Maps to infinite dimensions. Works on most problems.
Sigmoid Kernel
K(x, y) = tanh(γ × x·y + r)
Similar to neural network activation.
Key Hyperparameters
C (Regularization)
- Low C: Wider margin, more misclassifications allowed (underfitting)
- High C: Narrower margin, fewer errors (overfitting)
γ (Gamma) for RBF
- Low γ: Smooth decision boundary, far-reaching influence
- High γ: Complex boundary, local influence (overfitting)
Advantages
- Effective in high dimensions: Even when dimensions > samples
- Memory efficient: Only stores support vectors
- Versatile: Different kernels for different problems
- Strong theory: Maximum margin has theoretical guarantees
Disadvantages
- Doesn't scale well: O(n²) to O(n³) training
- Sensitive to feature scaling: Must normalize features
- No probability outputs: Requires calibration (Platt scaling)
- Kernel selection: Often requires trial and error
- Hard to interpret: Kernel space is abstract
SVM vs. Other Methods
| Aspect | SVM | Logistic Regression | Neural Networks |
|---|---|---|---|
| Non-linearity | Kernels | Feature engineering | Learned |
| Scalability | Poor | Good | Moderate |
| Interpretability | Low (with kernels) | High | Low |
| Probability output | No (needs calibration) | Yes | Yes |
When to Use SVMs
Good for:
- Small to medium datasets
- High-dimensional data (text classification)
- When margin interpretation matters
- Binary classification
Consider alternatives:
- Large datasets: Too slow
- Need probabilities: Logistic regression or calibrated SVM
- Images/sequences: Deep learning
Practical Tips
- Always scale features: SVMs are sensitive to scale
- Start with RBF kernel: Usually works well
- Grid search C and γ: Try powers of 10
- Consider LinearSVC for large data: Much faster
- Use probability=True if needed: Adds Platt scaling
Key Takeaways
- SVMs find the maximum margin hyperplane
- Support vectors are the critical training points
- Kernel trick enables non-linear classification
- RBF kernel is the default choice
- Trade off margin (C) and kernel complexity (γ)