beginnerClassical Machine Learning

Learn about regularization techniques that prevent overfitting by constraining model complexity, including L1, L2, and modern methods.

regularizationoverfittingl1l2model-selection

Regularization

Regularization is the art of preventing overfitting by adding constraints to the learning process. It's one of the most important concepts in machine learning.

Overfitting vs Good Fit

Training vs Validation Curves

Why Regularize?

Models can fit training data too well, capturing noise instead of signal. Regularization adds a penalty for complexity:

Total Loss = Training Loss + λ × Complexity Penalty

Where λ controls the regularization strength.

L2 Regularization (Ridge)

The Penalty

Penalty = λ × Σ wᵢ²

Adds the squared magnitude of weights to the loss.

Effect

  • Shrinks weights toward zero (but not exactly to zero)
  • Larger weights penalized more heavily
  • Weights distributed across features

Mathematical View

Equivalent to:

  • Constraining ||w||² ≤ t
  • Assuming Gaussian prior on weights (Bayesian view)

When to Use

  • When most features are useful
  • Don't need feature selection
  • Default choice for many problems

L1 Regularization (Lasso)

The Penalty

Penalty = λ × Σ |wᵢ|

Adds the absolute value of weights to the loss.

Effect

  • Drives some weights exactly to zero
  • Automatic feature selection
  • Produces sparse models

Why Sparsity?

The L1 constraint forms a diamond shape in weight space. Optimal points are likely to hit corners (where some weights are zero).

When to Use

  • When you suspect many features are irrelevant
  • Want interpretable sparse models
  • Feature selection is important

L1 vs L2 Comparison

AspectL1 (Lasso)L2 (Ridge)
PenaltySum ofwᵢ
SparsityYesNo
Feature selectionAutomaticNo
Correlated featuresPicks oneSpreads weight
StabilityLess stableMore stable
SolutionNot always uniqueUnique

Elastic Net

Combines L1 and L2:

Penalty = α × L1 + (1-α) × L2

Gets sparsity from L1 and stability from L2.

Best of both worlds for correlated features.

Regularization in Different Models

Linear/Logistic Regression

  • Ridge, Lasso, or Elastic Net
  • Controlled by λ (alpha in sklearn)

Decision Trees

  • Max depth
  • Min samples per leaf
  • Pruning

Neural Networks

  • Weight decay (L2 on weights)
  • Dropout
  • Batch normalization (implicit)
  • Early stopping

SVMs

  • C parameter (inverse of λ)
  • Lower C = more regularization

Modern Regularization Techniques

Dropout

Randomly zero out neurons during training.

  • Prevents co-adaptation
  • Ensemble effect
  • See: Dropout concept

Batch Normalization

Normalizes layer inputs.

  • Implicit regularization effect
  • Allows higher learning rates

Data Augmentation

Create modified versions of training data.

  • Rotations, flips, crops for images
  • Paraphrasing for text
  • Implicitly adds invariances

Early Stopping

Stop training before convergence.

  • Monitor validation loss
  • Implicit regularization
  • Simple but effective

Choosing Regularization Strength

Cross-Validation

The gold standard:

  1. Try several λ values
  2. Evaluate each with CV
  3. Pick λ with best validation score

Learning Curves

Plot train/validation error vs λ:

  • High λ: Both errors high (underfitting)
  • Low λ: Training low, validation high (overfitting)
  • Sweet spot: Where validation error is minimized

The Bayesian Perspective

Regularization is equivalent to placing priors on weights:

RegularizationPrior
L2Gaussian (mean 0)
L1Laplace (double exponential)
NoneUniform (improper)

Maximum A Posteriori (MAP) estimation with these priors gives regularized solutions.

Common Pitfalls

  1. Regularizing the bias: Usually don't regularize the intercept/bias term

  2. Not scaling features: With L1/L2, larger-scale features are penalized more. Standardize first!

  3. Same λ for all features: Sometimes different features need different regularization

  4. Over-regularizing: Can cause underfitting

Key Takeaways

  1. Regularization prevents overfitting by penalizing complexity
  2. L2 (Ridge): Shrinks weights, keeps all features
  3. L1 (Lasso): Produces sparse models, feature selection
  4. Elastic Net: Combines L1 and L2
  5. Always cross-validate to choose regularization strength
  6. Neural networks use dropout, batch norm, weight decay