Support Vector Machines

Support Vector Machines (SVMs) are powerful classifiers that find the optimal hyperplane separating different classes. They're known for working well in high-dimensional spaces and with the famous "kernel trick."

The Core Idea

Find the hyperplane that maximizes the margin - the distance to the nearest points from each class.

      Class A (●)                 Class B (■)
         ●                            ■
           ●     |  margin  |      ■
         ●       |←————————→|    ■
           ●     |          |      ■
              ●  |          |  ■
        support→ |——————————| ←support
        vectors  | hyperplane|  vectors

The points closest to the boundary are called support vectors.

Why Maximize the Margin?

Better generalization: Larger margin = more robust to noise
Unique solution: Only one maximum margin hyperplane
Sparse solution: Decision depends only on support vectors

Mathematical Formulation

Hard Margin SVM

For linearly separable data:

Maximize: margin = 2/||w||
Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Where w is the normal to the hyperplane and b is the bias.

Soft Margin SVM

Real data isn't perfectly separable. Allow some misclassification:

Minimize: ½||w||² + C Σ ξᵢ
Subject to: yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ

Where:

ξᵢ (slack variables) measure misclassification
C controls the tradeoff between margin and errors

The Kernel Trick

What if data isn't linearly separable? Map to higher dimensions where it is!

Problem: Computing in high dimensions is expensive.

Solution: The kernel trick. We only need dot products, and:

K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ)

A kernel function computes the dot product in high-dimensional space without actually transforming the data.

Common Kernels

Linear Kernel

K(x, y) = x · y

No transformation. Use when data is linearly separable.

Polynomial Kernel

K(x, y) = (γ × x·y + r)^d

Captures feature interactions up to degree d.

RBF (Gaussian) Kernel

K(x, y) = exp(-γ × ||x - y||²)

Most popular. Maps to infinite dimensions. Works on most problems.

Sigmoid Kernel

K(x, y) = tanh(γ × x·y + r)

Similar to neural network activation.

Key Hyperparameters

C (Regularization)

Low C: Wider margin, more misclassifications allowed (underfitting)
High C: Narrower margin, fewer errors (overfitting)

γ (Gamma) for RBF

Low γ: Smooth decision boundary, far-reaching influence
High γ: Complex boundary, local influence (overfitting)

Advantages

Effective in high dimensions: Even when dimensions > samples
Memory efficient: Only stores support vectors
Versatile: Different kernels for different problems
Strong theory: Maximum margin has theoretical guarantees

Disadvantages

Doesn't scale well: O(n²) to O(n³) training
Sensitive to feature scaling: Must normalize features
No probability outputs: Requires calibration (Platt scaling)
Kernel selection: Often requires trial and error
Hard to interpret: Kernel space is abstract

SVM vs. Other Methods

Aspect	SVM	Logistic Regression	Neural Networks
Non-linearity	Kernels	Feature engineering	Learned
Scalability	Poor	Good	Moderate
Interpretability	Low (with kernels)	High	Low
Probability output	No (needs calibration)	Yes	Yes

When to Use SVMs

Good for:

Small to medium datasets
High-dimensional data (text classification)
When margin interpretation matters
Binary classification

Consider alternatives:

Large datasets: Too slow
Need probabilities: Logistic regression or calibrated SVM
Images/sequences: Deep learning

Practical Tips

Always scale features: SVMs are sensitive to scale
Start with RBF kernel: Usually works well
Grid search C and γ: Try powers of 10
Consider LinearSVC for large data: Much faster
Use probability=True if needed: Adds Platt scaling

Key Takeaways

SVMs find the maximum margin hyperplane
Support vectors are the critical training points
Kernel trick enables non-linear classification
RBF kernel is the default choice
Trade off margin (C) and kernel complexity (γ)