Gradient Accumulation

Gradient accumulation is a technique that simulates larger batch sizes by accumulating gradients over multiple mini-batches before performing a weight update, enabling training of large models on limited hardware.

The Problem

Desired batch size: 64
GPU memory: Can only fit batch size 8

Solution: Accumulate gradients over 8 mini-batches of size 8
          Effective batch size: 8 × 8 = 64

How It Works

Without Accumulation (batch_size=8):
┌─────────┐   ┌─────────┐   ┌─────────┐
│ Batch 1 │ → │ Update  │ → │ Batch 2 │ → Update → ...
└─────────┘   └─────────┘   └─────────┘

With Accumulation (effective batch_size=32, accumulate=4):
┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│ Batch 1 │ → │ Batch 2 │ → │ Batch 3 │ → │ Batch 4 │ → │ Update  │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
  Accumulate   Accumulate    Accumulate   Accumulate      Step

Mathematical Equivalence

Large batch gradient:
  g = (1/N) × Σᵢ ∇L(xᵢ)

Accumulated gradient (k mini-batches of size n):
  g = (1/k) × Σⱼ (1/n) × Σᵢ∈batch_j ∇L(xᵢ)
    = (1/N) × Σᵢ ∇L(xᵢ)  (same!)
  
where N = k × n

Implementation

Basic PyTorch

import torch

model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = torch.nn.CrossEntropyLoss()

accumulation_steps = 4

for i, (inputs, targets) in enumerate(dataloader):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Normalize loss to account for accumulation
    loss = loss / accumulation_steps
    
    # Backward pass (accumulates gradients)
    loss.backward()
    
    # Update weights every accumulation_steps
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

With Gradient Clipping

for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        # Clip accumulated gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        optimizer.zero_grad()

With Mixed Precision (AMP)

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
accumulation_steps = 4

for i, (inputs, targets) in enumerate(dataloader):
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
    
    # Scale loss and backward
    scaler.scale(loss).backward()
    
    if (i + 1) % accumulation_steps == 0:
        scaler.unscale_(optimizer)  # Unscale before clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Hugging Face Transformers

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,  # Effective batch = 8 × 4 = 32
    learning_rate=2e-5,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)
trainer.train()

Important Considerations

1. Learning Rate Scaling

# Linear scaling rule (common for SGD)
base_lr = 0.01
base_batch_size = 32
effective_batch_size = per_device_batch * accumulation_steps * num_gpus

scaled_lr = base_lr * (effective_batch_size / base_batch_size)

2. Batch Normalization

Batch statistics computed per mini-batch, not accumulated:

# Solution 1: Use SyncBatchNorm for multi-GPU
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

# Solution 2: Use LayerNorm or GroupNorm instead
# (statistics independent of batch size)

3. Steps vs Epochs

# With accumulation, fewer optimizer steps per epoch
steps_per_epoch = len(dataloader) // accumulation_steps

# Adjust schedulers accordingly
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=epochs * steps_per_epoch  # Not len(dataloader)!
)

4. Dropout and Augmentation

Each mini-batch gets different dropout masks and augmentations:

Batch 1: Dropout pattern A, Augmentation A
Batch 2: Dropout pattern B, Augmentation B
...
Accumulate → Update

This adds slight noise (usually beneficial)

Memory Comparison

Batch size 64 directly:
  Memory = 64 × (activations + gradients)
  
Batch size 8 with accumulation=8:
  Memory = 8 × activations + all parameters' gradients
  (Gradients are accumulated in place, not stored per sample)

Trade-offs

Aspect	Direct Large Batch	Gradient Accumulation
Memory	High	Low
Speed	Fast	Slower (more passes)
BN statistics	Full batch	Mini-batch only
Variance	Lower	Slightly higher
Implementation	Simple	Requires care

When to Use

Use Gradient Accumulation When:

GPU memory is insufficient for desired batch size
Training large models (LLMs, ViT)
Need large batches for stability (contrastive learning)

Consider Alternatives When:

Memory allows direct large batches (faster)
Using batch-size-sensitive techniques (BatchNorm)
Training time is critical

Common Patterns

Pattern 1: Fixed Effective Batch Size

# Want effective batch of 256 across different GPUs
effective_batch = 256
per_device = 8
num_gpus = 4
accumulation = effective_batch // (per_device * num_gpus)  # = 8

Pattern 2: Dynamic Accumulation

# Increase batch size during training
for epoch in range(num_epochs):
    accum_steps = min(16, 2 ** (epoch // 2))  # 1, 1, 2, 2, 4, 4, 8, 8, 16, ...

Pattern 3: Handle Remainder Batches

for i, batch in enumerate(dataloader):
    loss = compute_loss(batch) / accumulation_steps
    loss.backward()
    
    # Update at end of accumulation OR end of epoch
    if (i + 1) % accumulation_steps == 0 or (i + 1) == len(dataloader):
        optimizer.step()
        optimizer.zero_grad()

Key Takeaways

Gradient accumulation simulates large batches with limited memory
Divide loss by accumulation steps to maintain correct gradient scale
Update weights only after accumulating all mini-batches
Adjust learning rate schedulers for fewer steps per epoch
BatchNorm statistics are computed per mini-batch (use alternatives if critical)
Speed is slower than direct large batches but enables otherwise impossible training