intermediateOptimization

Master learning rate scheduling - techniques to adjust the learning rate during training for faster convergence and better results.

optimizationtraininghyperparametersdeep-learning

Learning Rate Scheduling

The learning rate is the most important hyperparameter in deep learning. Learning rate scheduling adjusts it during training to achieve faster convergence and better final performance.

Learning Rate Effects

Why Schedule the Learning Rate?

The Dilemma

  • Large LR: Fast initial progress, but may overshoot minimum
  • Small LR: Precise convergence, but slow and may get stuck

The Solution

Start high (fast progress) → decrease over time (precise convergence)

Loss
  |\                     
  | \     High LR: Fast but unstable
  |  \___              
  |      \___    Low LR: Slow but precise
  |          \____
  |_______________
       Epochs

Common Schedules

Step Decay

Reduce LR by factor every N epochs:

scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# LR × 0.1 every 30 epochs
LR: 0.1 → 0.01 → 0.001 → ...
    |________|________|____
       30       60     90 epochs

Multi-Step Decay

Reduce at specific milestones:

scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)

Exponential Decay

Multiply by gamma every epoch:

scheduler = ExponentialLR(optimizer, gamma=0.95)
# LR_new = LR × 0.95 each epoch

Cosine Annealing

Smooth cosine curve from initial to minimum LR:

scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
LR
   |\
   | \    /\
   |  \  /  \
   |   \/    \
   |__________
     Epochs

Cosine Annealing with Warm Restarts

Cosine schedule that restarts periodically:

scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)

Restarts can help escape local minima.

Linear Decay

Linear decrease from initial to final LR:

# Common in NLP/transformers
scheduler = LinearLR(optimizer, start_factor=1.0, end_factor=0.0, total_iters=100)

Reduce On Plateau

Reduce when metric stops improving:

scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
# Reduce by 10x if no improvement for 10 epochs

# In training loop:
scheduler.step(val_loss)

Warmup

Gradually increase LR at the start of training:

LR
   |      ______
   |     /
   |    /
   |   /
   |__/____________
    warmup  training

Why Warmup?

  • Large gradients early in training with random weights
  • High LR + large gradients = instability
  • Warmup lets model find stable region first

Linear Warmup

def linear_warmup(epoch, warmup_epochs, initial_lr):
    if epoch < warmup_epochs:
        return initial_lr * (epoch + 1) / warmup_epochs
    return initial_lr

Warmup + Cosine Decay (Common Pattern)

from transformers import get_cosine_schedule_with_warmup

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=1000,
    num_training_steps=10000
)

Cyclical Learning Rates

Triangular

Oscillate between bounds:

scheduler = CyclicLR(optimizer, base_lr=1e-4, max_lr=1e-2, mode='triangular')

One Cycle

Single cycle: increase then decrease:

scheduler = OneCycleLR(
    optimizer,
    max_lr=1e-2,
    total_steps=1000,
    pct_start=0.3  # 30% warmup
)
LR
   |    /\
   |   /  \
   |  /    \
   | /      \____
   |/_____________
    30%      100%

Often achieves better results faster!

Finding the Initial Learning Rate

LR Range Test

  1. Start with very small LR
  2. Increase exponentially each batch
  3. Plot loss vs LR
  4. Choose LR where loss decreases fastest
from torch_lr_finder import LRFinder

lr_finder = LRFinder(model, optimizer, criterion)
lr_finder.range_test(train_loader, end_lr=10, num_iter=100)
lr_finder.plot()
Loss
   |\         /
   | \       /
   |  \_____/
   |     ↑
   |   Sweet spot
   |___________
         LR (log scale)

Best Practices

General Guidelines

ScenarioSchedule
Training from scratchCosine or step decay
Fine-tuningLinear decay with warmup
Quick trainingOneCycleLR
UncertainReduceLROnPlateau

Typical Values

# Vision (SGD)
initial_lr = 0.1

# Vision (Adam)
initial_lr = 1e-3 to 3e-4

# NLP/Transformers
initial_lr = 1e-5 to 5e-5
warmup = 5-10% of training

Common Patterns

ResNet-style:

StepLR(optimizer, step_size=30, gamma=0.1)

Transformer-style:

get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_steps // 10,
    num_training_steps=num_steps
)

Fast.ai style:

OneCycleLR(optimizer, max_lr=find_lr(), total_steps=...)

Code Template

# Training loop with scheduler
for epoch in range(num_epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(batch), targets)
        loss.backward()
        optimizer.step()
        
        # Step scheduler per batch (for OneCycle, Cosine, etc.)
        scheduler.step()
    
    # Or step per epoch (for StepLR, etc.)
    # scheduler.step()
    
    # For ReduceLROnPlateau
    # val_loss = validate()
    # scheduler.step(val_loss)

Key Takeaways

  1. Learning rate scheduling improves convergence and final performance
  2. Start high, decay over time is the general principle
  3. Warmup helps stabilize early training
  4. OneCycleLR often gives fast, good results
  5. ReduceLROnPlateau is safe when uncertain
  6. Use LR range test to find initial learning rate

Practice Questions

Test your understanding with these related interview questions: