intermediateOptimization

Understand Adam and modern optimizers - the algorithms that make neural network training fast and effective.

optimizationadamsgdtrainingdeep-learning

Adam and Other Optimizers

Optimizers are algorithms that update neural network weights to minimize loss. Adam (Adaptive Moment Estimation) is the most popular, combining the best ideas from previous optimizers.

The Optimization Problem

Training minimizes a loss function:

θ* = argmin L(θ)

The optimizer decides how to update weights θ given gradients ∇L.

Vanilla SGD

Update Rule

θ = θ - α × ∇L(θ)

Problems:

  • Same learning rate for all parameters
  • Gets stuck in saddle points
  • Oscillates in steep dimensions
  • Slow in flat dimensions

SGD with Momentum

Intuition

Accumulate velocity like a ball rolling downhill:

v = βv + ∇L(θ)         # Update velocity
θ = θ - α × v          # Update parameters

Benefits:

  • Accelerates in consistent gradient direction
  • Dampens oscillations
  • Helps escape local minima

Typical β = 0.9

Nesterov Momentum

Look ahead before computing gradient:

v = βv + ∇L(θ - αβv)
θ = θ - α × v

Slightly better convergence.

AdaGrad

Intuition

Adapt learning rate per parameter based on history:

g² += (∇L)²              # Accumulate squared gradients
θ = θ - α × ∇L / √(g²)   # Scale by inverse sqrt

Benefits:

  • Large updates for rare features
  • Small updates for frequent features
  • Good for sparse data

Problem: Learning rate decays to zero (never recovers)

RMSprop

Fix AdaGrad's Decay

Use exponential moving average instead of sum:

g² = βg² + (1-β)(∇L)²     # EMA of squared gradients
θ = θ - α × ∇L / √(g²+ε)  # Adaptive update

Typical β = 0.9, ε = 1e-8

Learning rate doesn't decay to zero.

Adam (Adaptive Moment Estimation)

Combines Momentum + RMSprop

m = β₁m + (1-β₁)∇L        # First moment (momentum)
v = β₂v + (1-β₂)(∇L)²     # Second moment (RMSprop)

# Bias correction (important early in training)
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)

θ = θ - α × m̂ / (√v̂ + ε)

Default Hyperparameters

α = 0.001    # Learning rate
β₁ = 0.9     # Momentum decay
β₂ = 0.999   # RMSprop decay
ε = 1e-8     # Numerical stability

Why Adam Works

  • Momentum (m): Accelerates in consistent directions
  • Adaptive LR (v): Per-parameter scaling
  • Bias correction: Good early training behavior

Adam Variants

AdamW (Weight Decay)

Decouples weight decay from gradient:

θ = θ - α × (m̂ / (√v̂ + ε) + λθ)

Better generalization - recommended over vanilla Adam.

AdaFactor

Memory-efficient Adam for large models:

  • Factorizes second moment matrix
  • Used in T5, large transformers

LAMB

Layer-wise Adaptive Moments for Batch training:

  • Scales updates by layer norm
  • Enables very large batch sizes

Lion

Newer optimizer from Google:

θ = θ - α × sign(βm + (1-β)∇L)
  • Simpler than Adam
  • Often better results
  • Less memory

Comparison

OptimizerMemorySpeedBest For
SGD+MomentumLowSlowFinal fine-tuning
AdamMediumFastDefault choice
AdamWMediumFastTransformers
LionLowFastLarge models
LAMBMediumFastLarge batch training

Practical Guidelines

Default Starting Point

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,           # Adjust based on task
    weight_decay=0.01  # Regularization
)

By Task Type

TaskOptimizerLearning Rate
Vision (from scratch)SGD+Momentum0.1
Vision (fine-tune)AdamW1e-4 to 1e-5
NLP/TransformersAdamW1e-5 to 5e-5
GANAdam1e-4 to 2e-4
RLAdam3e-4

When SGD Beats Adam

  • Vision models trained from scratch
  • When generalization matters most
  • With proper learning rate schedule
  • Final training runs (after hyperparameter search with Adam)

Code Examples

import torch.optim as optim

# SGD with momentum
optimizer = optim.SGD(
    model.parameters(), 
    lr=0.1, 
    momentum=0.9,
    weight_decay=1e-4
)

# Adam
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999)
)

# AdamW (recommended)
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01
)

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    optimizer.step()

Key Takeaways

  1. SGD is simple but slow; momentum helps
  2. Adam combines momentum + adaptive learning rates
  3. AdamW is Adam with proper weight decay
  4. Use AdamW as default for most deep learning
  5. SGD can generalize better for vision models
  6. Always combine with learning rate scheduling

Practice Questions

Test your understanding with these related interview questions: