Backpropagation

Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes how each weight in the network should change to reduce the loss.

The Problem

To train a neural network with gradient descent, we need:

∂Loss/∂wᵢⱼ for every weight wᵢⱼ

Naively computing each gradient separately would be incredibly slow. Backpropagation does it efficiently in one backward pass.

The Chain Rule

Backpropagation is just the chain rule applied systematically:

∂Loss/∂w = ∂Loss/∂output × ∂output/∂w

For deeper paths:

∂Loss/∂w = ∂Loss/∂L₃ × ∂L₃/∂L₂ × ∂L₂/∂L₁ × ∂L₁/∂w

Forward Pass vs Backward Pass

Forward Pass

Compute outputs layer by layer:

x → h₁ = σ(W₁x) → h₂ = σ(W₂h₁) → ... → output → loss

Save intermediate values (activations) for backward pass.

Backward Pass

Compute gradients from output to input:

∂L/∂output → ∂L/∂h₂ → ∂L/∂h₁ → ∂L/∂W₁, ∂L/∂W₂, ...

Reuse intermediate results (gradients flow backward).

Step by Step Example

Consider a simple network: x → Linear → ReLU → Linear → Loss

Forward Pass

z₁ = W₁x + b₁
a₁ = ReLU(z₁)
z₂ = W₂a₁ + b₂
L = Loss(z₂, y)

Backward Pass

∂L/∂z₂ = ∂Loss/∂z₂           # Derivative of loss
∂L/∂W₂ = ∂L/∂z₂ × a₁ᵀ        # Gradient for W₂
∂L/∂a₁ = W₂ᵀ × ∂L/∂z₂        # Gradient flows back
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)  # Through ReLU
∂L/∂W₁ = ∂L/∂z₁ × xᵀ         # Gradient for W₁

Why It's Efficient

Naive approach: For each weight, trace path from input to output. O(n × path_length) operations.

Backprop: Compute once, reuse gradients. O(n) operations - same as forward pass!

This efficiency is what makes training deep networks practical.

Computational Graphs

Modern frameworks (PyTorch, TensorFlow) build computational graphs:

     x        W₁
      \      /
       MatMul
         |
        ReLU
         |
        Loss

Each node knows its local gradient. Backprop traverses the graph backward, multiplying gradients.

Gradients Through Common Operations

Linear Layer: y = Wx + b

∂L/∂W = ∂L/∂y × xᵀ
∂L/∂x = Wᵀ × ∂L/∂y
∂L/∂b = ∂L/∂y

Activation Functions

ReLU:     ∂L/∂x = ∂L/∂y × (x > 0)
Sigmoid:  ∂L/∂x = ∂L/∂y × σ(x)(1-σ(x))
Tanh:     ∂L/∂x = ∂L/∂y × (1 - tanh²(x))

Softmax + Cross-Entropy

∂L/∂logits = softmax(logits) - one_hot(label)

Beautifully simple when combined!

Challenges

Vanishing Gradients

Gradients shrink exponentially in deep networks:

Each layer multiplies by something ≤ 1
Earlier layers learn very slowly

Solutions: ReLU, residual connections, batch norm, better initialization.

Exploding Gradients

Gradients grow exponentially:

Weights become unstable
Loss goes to infinity

Solutions: Gradient clipping, proper initialization, normalization.

Memory

Must store all activations for backward pass.

Memory scales with batch size × depth

Solutions: Gradient checkpointing (recompute instead of store).

Automatic Differentiation

Modern frameworks implement autodiff:

# PyTorch example
x = torch.tensor([1.0], requires_grad=True)
y = x**2 + 3*x
y.backward()  # Computes gradients automatically
print(x.grad)  # dy/dx = 2x + 3 = 5

You define forward computation; framework handles backward.

Not Just for Neural Networks

Backprop works for any differentiable computation:

Physics simulations
Graphics rendering
Control systems
Probabilistic programming

Key Takeaways

Backpropagation efficiently computes gradients for all weights
It's the chain rule applied systematically from output to input
Requires storing forward pass activations
Vanishing/exploding gradients are key challenges
Modern frameworks handle this automatically (autodiff)

Backpropagation

The Problem

The Chain Rule

Forward Pass vs Backward Pass

Forward Pass

Backward Pass

Step by Step Example

Forward Pass

Backward Pass

Why It's Efficient

Computational Graphs

Gradients Through Common Operations

Linear Layer: y = Wx + b

Activation Functions

Softmax + Cross-Entropy

Challenges

Vanishing Gradients

Exploding Gradients

Memory

Automatic Differentiation

Not Just for Neural Networks

Key Takeaways

Related Concepts