Backpropagation
Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes how each weight in the network should change to reduce the loss.
The Problem
To train a neural network with gradient descent, we need:
∂Loss/∂wᵢⱼ for every weight wᵢⱼ
Naively computing each gradient separately would be incredibly slow. Backpropagation does it efficiently in one backward pass.
The Chain Rule
Backpropagation is just the chain rule applied systematically:
∂Loss/∂w = ∂Loss/∂output × ∂output/∂w
For deeper paths:
∂Loss/∂w = ∂Loss/∂L₃ × ∂L₃/∂L₂ × ∂L₂/∂L₁ × ∂L₁/∂w
Forward Pass vs Backward Pass
Forward Pass
Compute outputs layer by layer:
x → h₁ = σ(W₁x) → h₂ = σ(W₂h₁) → ... → output → loss
Save intermediate values (activations) for backward pass.
Backward Pass
Compute gradients from output to input:
∂L/∂output → ∂L/∂h₂ → ∂L/∂h₁ → ∂L/∂W₁, ∂L/∂W₂, ...
Reuse intermediate results (gradients flow backward).
Step by Step Example
Consider a simple network: x → Linear → ReLU → Linear → Loss
Forward Pass
z₁ = W₁x + b₁
a₁ = ReLU(z₁)
z₂ = W₂a₁ + b₂
L = Loss(z₂, y)
Backward Pass
∂L/∂z₂ = ∂Loss/∂z₂ # Derivative of loss
∂L/∂W₂ = ∂L/∂z₂ × a₁ᵀ # Gradient for W₂
∂L/∂a₁ = W₂ᵀ × ∂L/∂z₂ # Gradient flows back
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁) # Through ReLU
∂L/∂W₁ = ∂L/∂z₁ × xᵀ # Gradient for W₁
Why It's Efficient
Naive approach: For each weight, trace path from input to output. O(n × path_length) operations.
Backprop: Compute once, reuse gradients. O(n) operations - same as forward pass!
This efficiency is what makes training deep networks practical.
Computational Graphs
Modern frameworks (PyTorch, TensorFlow) build computational graphs:
x W₁
\ /
MatMul
|
ReLU
|
Loss
Each node knows its local gradient. Backprop traverses the graph backward, multiplying gradients.
Gradients Through Common Operations
Linear Layer: y = Wx + b
∂L/∂W = ∂L/∂y × xᵀ
∂L/∂x = Wᵀ × ∂L/∂y
∂L/∂b = ∂L/∂y
Activation Functions
ReLU: ∂L/∂x = ∂L/∂y × (x > 0)
Sigmoid: ∂L/∂x = ∂L/∂y × σ(x)(1-σ(x))
Tanh: ∂L/∂x = ∂L/∂y × (1 - tanh²(x))
Softmax + Cross-Entropy
∂L/∂logits = softmax(logits) - one_hot(label)
Beautifully simple when combined!
Challenges
Vanishing Gradients
Gradients shrink exponentially in deep networks:
- Each layer multiplies by something ≤ 1
- Earlier layers learn very slowly
Solutions: ReLU, residual connections, batch norm, better initialization.
Exploding Gradients
Gradients grow exponentially:
- Weights become unstable
- Loss goes to infinity
Solutions: Gradient clipping, proper initialization, normalization.
Memory
Must store all activations for backward pass.
- Memory scales with batch size × depth
Solutions: Gradient checkpointing (recompute instead of store).
Automatic Differentiation
Modern frameworks implement autodiff:
# PyTorch example
x = torch.tensor([1.0], requires_grad=True)
y = x**2 + 3*x
y.backward() # Computes gradients automatically
print(x.grad) # dy/dx = 2x + 3 = 5
You define forward computation; framework handles backward.
Not Just for Neural Networks
Backprop works for any differentiable computation:
- Physics simulations
- Graphics rendering
- Control systems
- Probabilistic programming
Key Takeaways
- Backpropagation efficiently computes gradients for all weights
- It's the chain rule applied systematically from output to input
- Requires storing forward pass activations
- Vanishing/exploding gradients are key challenges
- Modern frameworks handle this automatically (autodiff)