intermediateClassical Machine Learning

Learn about Support Vector Machines (SVMs) - powerful classifiers that find optimal decision boundaries using the kernel trick.

classificationkernel-methodssupervised-learning

Support Vector Machines

Support Vector Machines (SVMs) are powerful classifiers that find the optimal hyperplane separating different classes. They're known for working well in high-dimensional spaces and with the famous "kernel trick."

The Core Idea

Find the hyperplane that maximizes the margin - the distance to the nearest points from each class.

      Class A (●)                 Class B (■)
         ●                            ■
           ●     |  margin  |      ■
         ●       |←————————→|    ■
           ●     |          |      ■
              ●  |          |  ■
        support→ |——————————| ←support
        vectors  | hyperplane|  vectors

The points closest to the boundary are called support vectors.

Why Maximize the Margin?

  1. Better generalization: Larger margin = more robust to noise
  2. Unique solution: Only one maximum margin hyperplane
  3. Sparse solution: Decision depends only on support vectors

Mathematical Formulation

Hard Margin SVM

For linearly separable data:

Maximize: margin = 2/||w||
Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i

Where w is the normal to the hyperplane and b is the bias.

Soft Margin SVM

Real data isn't perfectly separable. Allow some misclassification:

Minimize: ½||w||² + C Σ ξᵢ
Subject to: yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ

Where:

  • ξᵢ (slack variables) measure misclassification
  • C controls the tradeoff between margin and errors

The Kernel Trick

What if data isn't linearly separable? Map to higher dimensions where it is!

Problem: Computing in high dimensions is expensive.

Solution: The kernel trick. We only need dot products, and:

K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ)

A kernel function computes the dot product in high-dimensional space without actually transforming the data.

Common Kernels

Linear Kernel

K(x, y) = x · y

No transformation. Use when data is linearly separable.

Polynomial Kernel

K(x, y) = (γ × x·y + r)^d

Captures feature interactions up to degree d.

RBF (Gaussian) Kernel

K(x, y) = exp(-γ × ||x - y||²)

Most popular. Maps to infinite dimensions. Works on most problems.

Sigmoid Kernel

K(x, y) = tanh(γ × x·y + r)

Similar to neural network activation.

Key Hyperparameters

C (Regularization)

  • Low C: Wider margin, more misclassifications allowed (underfitting)
  • High C: Narrower margin, fewer errors (overfitting)

γ (Gamma) for RBF

  • Low γ: Smooth decision boundary, far-reaching influence
  • High γ: Complex boundary, local influence (overfitting)

Advantages

  1. Effective in high dimensions: Even when dimensions > samples
  2. Memory efficient: Only stores support vectors
  3. Versatile: Different kernels for different problems
  4. Strong theory: Maximum margin has theoretical guarantees

Disadvantages

  1. Doesn't scale well: O(n²) to O(n³) training
  2. Sensitive to feature scaling: Must normalize features
  3. No probability outputs: Requires calibration (Platt scaling)
  4. Kernel selection: Often requires trial and error
  5. Hard to interpret: Kernel space is abstract

SVM vs. Other Methods

AspectSVMLogistic RegressionNeural Networks
Non-linearityKernelsFeature engineeringLearned
ScalabilityPoorGoodModerate
InterpretabilityLow (with kernels)HighLow
Probability outputNo (needs calibration)YesYes

When to Use SVMs

Good for:

  • Small to medium datasets
  • High-dimensional data (text classification)
  • When margin interpretation matters
  • Binary classification

Consider alternatives:

  • Large datasets: Too slow
  • Need probabilities: Logistic regression or calibrated SVM
  • Images/sequences: Deep learning

Practical Tips

  1. Always scale features: SVMs are sensitive to scale
  2. Start with RBF kernel: Usually works well
  3. Grid search C and γ: Try powers of 10
  4. Consider LinearSVC for large data: Much faster
  5. Use probability=True if needed: Adds Platt scaling

Key Takeaways

  1. SVMs find the maximum margin hyperplane
  2. Support vectors are the critical training points
  3. Kernel trick enables non-linear classification
  4. RBF kernel is the default choice
  5. Trade off margin (C) and kernel complexity (γ)

Practice Questions

Test your understanding with these related interview questions: