Momentum in Optimization
Momentum is a technique that accelerates gradient descent by adding a fraction of the previous update to the current update, helping the optimizer build up speed in consistent directions and dampen oscillations.
The Problem with Vanilla Gradient Descent
Vanilla GD on elongated loss surface:
○ Start
╱
╱
╱ ╲
╱ ╲
╱ ╲ Oscillates in steep direction
╱ ╲
╱ ╲
○───────────○───────────○
Slow progress in flat direction
Gradient descent oscillates in steep directions and moves slowly in flat directions.
How Momentum Works
Physical Analogy
Imagine a ball rolling down a hill:
- Gains speed going downhill (accumulates velocity)
- Momentum carries it through small bumps
- Eventually settles at the bottom
Mathematical Formulation
Vanilla GD:
θ = θ - α × ∇L(θ)
With Momentum:
v = β × v + ∇L(θ) # Update velocity
θ = θ - α × v # Update parameters
Or equivalently:
v = β × v + α × ∇L(θ) # Velocity with learning rate
θ = θ - v # Update parameters
Where:
v: velocity (accumulated gradients)β: momentum coefficient (typically 0.9)α: learning rate∇L(θ): gradient at current position
Visualization
Without Momentum: With Momentum:
○ ○
│╲ │
│ ╲ │
│ ╲ ↓
│ ╲ ↓
○────○ ↓
│ ╲ ↓
│ ╲ ○ Faster!
○───────○
Oscillates Smoother path
Implementation
From Scratch
class SGDMomentum:
def __init__(self, params, lr=0.01, momentum=0.9):
self.params = list(params)
self.lr = lr
self.momentum = momentum
self.velocities = [torch.zeros_like(p) for p in self.params]
def step(self):
for param, velocity in zip(self.params, self.velocities):
if param.grad is None:
continue
# Update velocity
velocity.mul_(self.momentum).add_(param.grad)
# Update parameters
param.data.add_(velocity, alpha=-self.lr)
def zero_grad(self):
for param in self.params:
if param.grad is not None:
param.grad.zero_()
PyTorch
import torch.optim as optim
# SGD with momentum
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9
)
# Training loop
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()
Momentum Variations
Classical Momentum
v = β × v + ∇L(θ)
θ = θ - α × v
Nesterov Momentum
Look ahead before computing gradient:
# Compute gradient at "look-ahead" position
θ_lookahead = θ - α × β × v
v = β × v + ∇L(θ_lookahead)
θ = θ - α × v
Classical: Compute gradient → Update
Nesterov: Look ahead → Compute gradient → Update
(More accurate gradient direction)
PyTorch Nesterov
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True # Enable Nesterov momentum
)
Effect of Momentum Coefficient
β = 0.0: No momentum (vanilla GD)
β = 0.5: Moderate smoothing
β = 0.9: Strong momentum (most common)
β = 0.99: Very strong (can overshoot)
Effective Learning Rate
With momentum, the effective step size is larger:
If gradients are consistent:
Effective LR ≈ α / (1 - β)
With β=0.9, α=0.01:
Effective LR ≈ 0.01 / 0.1 = 0.1 (10x larger!)
When Momentum Helps
1. Ravines/Elongated Surfaces
Loss surface like a valley:
- Steep walls (large gradients)
- Gradual slope along valley (small gradients)
Momentum:
- Oscillations in steep direction cancel out
- Consistent movement along valley accumulates
2. Noisy Gradients
Gradient estimates from mini-batches are noisy.
Momentum averages out noise:
v_t = β × v_{t-1} + g_t
= β × (β × v_{t-2} + g_{t-1}) + g_t
= β² × v_{t-2} + β × g_{t-1} + g_t
≈ exponential moving average of gradients
3. Saddle Points
At saddle points, gradients are small.
Momentum carries optimizer past them:
○→→→→→→○ (momentum helps escape)
Saddle
Momentum in Adam
Adam uses momentum for both gradients and squared gradients:
# First moment (like momentum)
m = β1 × m + (1 - β1) × g
# Second moment (for adaptive learning rate)
v = β2 × v + (1 - β2) × g²
# Update
θ = θ - α × m / (√v + ε)
Typical: β1=0.9 (momentum), β2=0.999 (RMSprop-like)
Practical Tips
1. Standard Values
# These work well in most cases
momentum = 0.9
lr = 0.01 # Might need to be lower than without momentum
2. Learning Rate Interaction
# Without momentum: lr = 0.1 works
# With momentum: lr = 0.01-0.03 often better
# (momentum amplifies effective step size)
3. Warmup with Momentum
# Start with lower momentum, increase
for epoch in range(num_epochs):
momentum = min(0.9, 0.5 + epoch * 0.1)
for param_group in optimizer.param_groups:
param_group['momentum'] = momentum
Comparison: SGD vs Momentum vs Adam
| Optimizer | Convergence | Memory | Hyperparameters |
|---|---|---|---|
| SGD | Slow | O(1) | lr only |
| SGD + Momentum | Fast | O(n) | lr, β |
| Adam | Fast | O(2n) | lr, β1, β2 |
For CNNs: SGD + Momentum often best
For Transformers: Adam/AdamW often best
For quick experiments: Adam is reliable default
Debugging Momentum
Too High Momentum
Symptoms: Oscillations, overshooting minima
Fix: Lower β (try 0.8) or lower learning rate
Too Low Momentum
Symptoms: Slow convergence, stuck in ravines
Fix: Increase β (try 0.95)
Monitoring
# Log gradient and velocity norms
grad_norm = sum(p.grad.norm() for p in model.parameters() if p.grad is not None)
vel_norm = sum(v.norm() for v in optimizer.state_dict()['state'].values())
print(f"Grad norm: {grad_norm:.4f}, Velocity norm: {vel_norm:.4f}")
Key Takeaways
- Momentum accumulates past gradients to accelerate consistent directions
- Typical momentum coefficient: β = 0.9
- Dampens oscillations in steep directions
- Nesterov momentum computes gradient at look-ahead position
- May need to reduce learning rate when adding momentum
- Adam includes momentum as its first moment estimate