Adam and Other Optimizers

Optimizers are algorithms that update neural network weights to minimize loss. Adam (Adaptive Moment Estimation) is the most popular, combining the best ideas from previous optimizers.

The Optimization Problem

Training minimizes a loss function:

θ* = argmin L(θ)

The optimizer decides how to update weights θ given gradients ∇L.

Vanilla SGD

Update Rule

θ = θ - α × ∇L(θ)

Problems:

Same learning rate for all parameters
Gets stuck in saddle points
Oscillates in steep dimensions
Slow in flat dimensions

SGD with Momentum

Intuition

Accumulate velocity like a ball rolling downhill:

v = βv + ∇L(θ)         # Update velocity
θ = θ - α × v          # Update parameters

Benefits:

Accelerates in consistent gradient direction
Dampens oscillations
Helps escape local minima

Typical β = 0.9

Nesterov Momentum

Look ahead before computing gradient:

v = βv + ∇L(θ - αβv)
θ = θ - α × v

Slightly better convergence.

AdaGrad

Intuition

Adapt learning rate per parameter based on history:

g² += (∇L)²              # Accumulate squared gradients
θ = θ - α × ∇L / √(g²)   # Scale by inverse sqrt

Benefits:

Large updates for rare features
Small updates for frequent features
Good for sparse data

Problem: Learning rate decays to zero (never recovers)

RMSprop

Fix AdaGrad's Decay

Use exponential moving average instead of sum:

g² = βg² + (1-β)(∇L)²     # EMA of squared gradients
θ = θ - α × ∇L / √(g²+ε)  # Adaptive update

Typical β = 0.9, ε = 1e-8

Learning rate doesn't decay to zero.

Adam (Adaptive Moment Estimation)

Combines Momentum + RMSprop

m = β₁m + (1-β₁)∇L        # First moment (momentum)
v = β₂v + (1-β₂)(∇L)²     # Second moment (RMSprop)

# Bias correction (important early in training)
m̂ = m / (1-β₁ᵗ)
v̂ = v / (1-β₂ᵗ)

θ = θ - α × m̂ / (√v̂ + ε)

Default Hyperparameters

α = 0.001    # Learning rate
β₁ = 0.9     # Momentum decay
β₂ = 0.999   # RMSprop decay
ε = 1e-8     # Numerical stability

Why Adam Works

Momentum (m): Accelerates in consistent directions
Adaptive LR (v): Per-parameter scaling
Bias correction: Good early training behavior

Adam Variants

AdamW (Weight Decay)

Decouples weight decay from gradient:

θ = θ - α × (m̂ / (√v̂ + ε) + λθ)

Better generalization - recommended over vanilla Adam.

AdaFactor

Memory-efficient Adam for large models:

Factorizes second moment matrix
Used in T5, large transformers

LAMB

Layer-wise Adaptive Moments for Batch training:

Scales updates by layer norm
Enables very large batch sizes

Lion

Newer optimizer from Google:

θ = θ - α × sign(βm + (1-β)∇L)

Simpler than Adam
Often better results
Less memory

Comparison

Optimizer	Memory	Speed	Best For
SGD+Momentum	Low	Slow	Final fine-tuning
Adam	Medium	Fast	Default choice
AdamW	Medium	Fast	Transformers
Lion	Low	Fast	Large models
LAMB	Medium	Fast	Large batch training

Practical Guidelines

Default Starting Point

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,           # Adjust based on task
    weight_decay=0.01  # Regularization
)

By Task Type

Task	Optimizer	Learning Rate
Vision (from scratch)	SGD+Momentum	0.1
Vision (fine-tune)	AdamW	1e-4 to 1e-5
NLP/Transformers	AdamW	1e-5 to 5e-5
GAN	Adam	1e-4 to 2e-4
RL	Adam	3e-4

When SGD Beats Adam

Vision models trained from scratch
When generalization matters most
With proper learning rate schedule
Final training runs (after hyperparameter search with Adam)

Code Examples

import torch.optim as optim

# SGD with momentum
optimizer = optim.SGD(
    model.parameters(), 
    lr=0.1, 
    momentum=0.9,
    weight_decay=1e-4
)

# Adam
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999)
)

# AdamW (recommended)
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01
)

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    optimizer.step()

Key Takeaways

SGD is simple but slow; momentum helps
Adam combines momentum + adaptive learning rates
AdamW is Adam with proper weight decay
Use AdamW as default for most deep learning
SGD can generalize better for vision models
Always combine with learning rate scheduling

Adam and Other Optimizers

The Optimization Problem

Vanilla SGD

Update Rule

SGD with Momentum

Intuition

Nesterov Momentum

AdaGrad

Intuition

RMSprop

Fix AdaGrad's Decay

Adam (Adaptive Moment Estimation)

Combines Momentum + RMSprop

Default Hyperparameters

Why Adam Works

Adam Variants

AdamW (Weight Decay)

AdaFactor

LAMB

Lion

Comparison

Practical Guidelines

Default Starting Point

By Task Type

When SGD Beats Adam

Code Examples

Key Takeaways

Related Concepts

Practice Questions