Momentum in Optimization

Momentum is a technique that accelerates gradient descent by adding a fraction of the previous update to the current update, helping the optimizer build up speed in consistent directions and dampen oscillations.

The Problem with Vanilla Gradient Descent

Vanilla GD on elongated loss surface:

                ○ Start
                ╱
               ╱
              ╱ ╲
             ╱   ╲
            ╱     ╲   Oscillates in steep direction
           ╱       ╲
          ╱         ╲
         ○───────────○───────────○
                              Slow progress in flat direction

Gradient descent oscillates in steep directions and moves slowly in flat directions.

How Momentum Works

Physical Analogy

Imagine a ball rolling down a hill:

Gains speed going downhill (accumulates velocity)
Momentum carries it through small bumps
Eventually settles at the bottom

Mathematical Formulation

Vanilla GD:
  θ = θ - α × ∇L(θ)

With Momentum:
  v = β × v + ∇L(θ)        # Update velocity
  θ = θ - α × v            # Update parameters

Or equivalently:
  v = β × v + α × ∇L(θ)    # Velocity with learning rate
  θ = θ - v                # Update parameters

Where:

v: velocity (accumulated gradients)
β: momentum coefficient (typically 0.9)
α: learning rate
∇L(θ): gradient at current position

Visualization

Without Momentum:        With Momentum:

    ○                        ○
    │╲                       │
    │ ╲                      │
    │  ╲                     ↓
    │   ╲                    ↓
    ○────○                   ↓
    │     ╲                  ↓
    │      ╲                 ○ Faster!
    ○───────○

Oscillates               Smoother path

Implementation

From Scratch

class SGDMomentum:
    def __init__(self, params, lr=0.01, momentum=0.9):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        self.velocities = [torch.zeros_like(p) for p in self.params]
    
    def step(self):
        for param, velocity in zip(self.params, self.velocities):
            if param.grad is None:
                continue
            
            # Update velocity
            velocity.mul_(self.momentum).add_(param.grad)
            
            # Update parameters
            param.data.add_(velocity, alpha=-self.lr)
    
    def zero_grad(self):
        for param in self.params:
            if param.grad is not None:
                param.grad.zero_()

PyTorch

import torch.optim as optim

# SGD with momentum
optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9
)

# Training loop
for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    loss.backward()
    optimizer.step()

Momentum Variations

Classical Momentum

v = β × v + ∇L(θ)
θ = θ - α × v

Nesterov Momentum

Look ahead before computing gradient:

# Compute gradient at "look-ahead" position
θ_lookahead = θ - α × β × v
v = β × v + ∇L(θ_lookahead)
θ = θ - α × v

Classical:    Compute gradient → Update
Nesterov:     Look ahead → Compute gradient → Update
              (More accurate gradient direction)

PyTorch Nesterov

optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True  # Enable Nesterov momentum
)

Effect of Momentum Coefficient

β = 0.0:  No momentum (vanilla GD)
β = 0.5:  Moderate smoothing
β = 0.9:  Strong momentum (most common)
β = 0.99: Very strong (can overshoot)

Effective Learning Rate

With momentum, the effective step size is larger:

If gradients are consistent:
  Effective LR ≈ α / (1 - β)
  
  With β=0.9, α=0.01:
  Effective LR ≈ 0.01 / 0.1 = 0.1 (10x larger!)

When Momentum Helps

1. Ravines/Elongated Surfaces

Loss surface like a valley:
  - Steep walls (large gradients)
  - Gradual slope along valley (small gradients)
  
Momentum:
  - Oscillations in steep direction cancel out
  - Consistent movement along valley accumulates

2. Noisy Gradients

Gradient estimates from mini-batches are noisy.
Momentum averages out noise:

v_t = β × v_{t-1} + g_t
    = β × (β × v_{t-2} + g_{t-1}) + g_t
    = β² × v_{t-2} + β × g_{t-1} + g_t
    ≈ exponential moving average of gradients

3. Saddle Points

At saddle points, gradients are small.
Momentum carries optimizer past them:

      ○→→→→→→○ (momentum helps escape)
         Saddle

Momentum in Adam

Adam uses momentum for both gradients and squared gradients:

# First moment (like momentum)
m = β1 × m + (1 - β1) × g

# Second moment (for adaptive learning rate)
v = β2 × v + (1 - β2) × g²

# Update
θ = θ - α × m / (√v + ε)

Typical: β1=0.9 (momentum), β2=0.999 (RMSprop-like)

Practical Tips

1. Standard Values

# These work well in most cases
momentum = 0.9
lr = 0.01  # Might need to be lower than without momentum

2. Learning Rate Interaction

# Without momentum: lr = 0.1 works
# With momentum: lr = 0.01-0.03 often better
#                (momentum amplifies effective step size)

3. Warmup with Momentum

# Start with lower momentum, increase
for epoch in range(num_epochs):
    momentum = min(0.9, 0.5 + epoch * 0.1)
    for param_group in optimizer.param_groups:
        param_group['momentum'] = momentum

Comparison: SGD vs Momentum vs Adam

Optimizer	Convergence	Memory	Hyperparameters
SGD	Slow	O(1)	lr only
SGD + Momentum	Fast	O(n)	lr, β
Adam	Fast	O(2n)	lr, β1, β2

For CNNs: SGD + Momentum often best
For Transformers: Adam/AdamW often best
For quick experiments: Adam is reliable default

Debugging Momentum

Too High Momentum

Symptoms: Oscillations, overshooting minima
Fix: Lower β (try 0.8) or lower learning rate

Too Low Momentum

Symptoms: Slow convergence, stuck in ravines
Fix: Increase β (try 0.95)

Monitoring

# Log gradient and velocity norms
grad_norm = sum(p.grad.norm() for p in model.parameters() if p.grad is not None)
vel_norm = sum(v.norm() for v in optimizer.state_dict()['state'].values())
print(f"Grad norm: {grad_norm:.4f}, Velocity norm: {vel_norm:.4f}")

Key Takeaways

Momentum accumulates past gradients to accelerate consistent directions
Typical momentum coefficient: β = 0.9
Dampens oscillations in steep directions
Nesterov momentum computes gradient at look-ahead position
May need to reduce learning rate when adding momentum
Adam includes momentum as its first moment estimate