Adam and Other Optimizers
Optimizers are algorithms that update neural network weights to minimize loss. Adam (Adaptive Moment Estimation) is the most popular, combining the best ideas from previous optimizers.
The Optimization Problem
Training minimizes a loss function:
θ* = argmin L(θ)
The optimizer decides how to update weights θ given gradients ∇L.
Vanilla SGD
Update Rule
θ = θ - α × ∇L(θ)
Problems:
- Same learning rate for all parameters
- Gets stuck in saddle points
- Oscillates in steep dimensions
- Slow in flat dimensions
SGD with Momentum
Intuition
Accumulate velocity like a ball rolling downhill:
v = βv + ∇L(θ) # Update velocity
θ = θ - α × v # Update parameters
Benefits:
- Accelerates in consistent gradient direction
- Dampens oscillations
- Helps escape local minima
Typical β = 0.9
Nesterov Momentum
Look ahead before computing gradient:
v = βv + ∇L(θ - αβv)
θ = θ - α × v
Slightly better convergence.
AdaGrad
Intuition
Adapt learning rate per parameter based on history:
g² += (∇L)² # Accumulate squared gradients
θ = θ - α × ∇L / √(g²) # Scale by inverse sqrt
Benefits:
- Large updates for rare features
- Small updates for frequent features
- Good for sparse data
Problem: Learning rate decays to zero (never recovers)
RMSprop
Fix AdaGrad's Decay
Use exponential moving average instead of sum:
g² = βg² + (1-β)(∇L)² # EMA of squared gradients
θ = θ - α × ∇L / √(g²+ε) # Adaptive update
Typical β = 0.9, ε = 1e-8
Learning rate doesn't decay to zero.
Adam (Adaptive Moment Estimation)
Combines Momentum + RMSprop
m = β₁m + (1-β₁)∇L # First moment (momentum)
v = β₂v + (1-β₂)(∇L)² # Second moment (RMSprop)
# Bias correction (important early in training)
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)
θ = θ - α × m̂ / (√v̂ + ε)
Default Hyperparameters
α = 0.001 # Learning rate
β₁ = 0.9 # Momentum decay
β₂ = 0.999 # RMSprop decay
ε = 1e-8 # Numerical stability
Why Adam Works
- Momentum (m): Accelerates in consistent directions
- Adaptive LR (v): Per-parameter scaling
- Bias correction: Good early training behavior
Adam Variants
AdamW (Weight Decay)
Decouples weight decay from gradient:
θ = θ - α × (m̂ / (√v̂ + ε) + λθ)
Better generalization - recommended over vanilla Adam.
AdaFactor
Memory-efficient Adam for large models:
- Factorizes second moment matrix
- Used in T5, large transformers
LAMB
Layer-wise Adaptive Moments for Batch training:
- Scales updates by layer norm
- Enables very large batch sizes
Lion
Newer optimizer from Google:
θ = θ - α × sign(βm + (1-β)∇L)
- Simpler than Adam
- Often better results
- Less memory
Comparison
| Optimizer | Memory | Speed | Best For |
|---|---|---|---|
| SGD+Momentum | Low | Slow | Final fine-tuning |
| Adam | Medium | Fast | Default choice |
| AdamW | Medium | Fast | Transformers |
| Lion | Low | Fast | Large models |
| LAMB | Medium | Fast | Large batch training |
Practical Guidelines
Default Starting Point
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3, # Adjust based on task
weight_decay=0.01 # Regularization
)
By Task Type
| Task | Optimizer | Learning Rate |
|---|---|---|
| Vision (from scratch) | SGD+Momentum | 0.1 |
| Vision (fine-tune) | AdamW | 1e-4 to 1e-5 |
| NLP/Transformers | AdamW | 1e-5 to 5e-5 |
| GAN | Adam | 1e-4 to 2e-4 |
| RL | Adam | 3e-4 |
When SGD Beats Adam
- Vision models trained from scratch
- When generalization matters most
- With proper learning rate schedule
- Final training runs (after hyperparameter search with Adam)
Code Examples
import torch.optim as optim
# SGD with momentum
optimizer = optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4
)
# Adam
optimizer = optim.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999)
)
# AdamW (recommended)
optimizer = optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.01
)
# Training loop
for batch in dataloader:
optimizer.zero_grad()
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
Key Takeaways
- SGD is simple but slow; momentum helps
- Adam combines momentum + adaptive learning rates
- AdamW is Adam with proper weight decay
- Use AdamW as default for most deep learning
- SGD can generalize better for vision models
- Always combine with learning rate scheduling