intermediateDeep Learning

Understand backpropagation - the algorithm that makes training neural networks possible by efficiently computing gradients.

optimizationneural-networksgradientstraining

Backpropagation

Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes how each weight in the network should change to reduce the loss.

The Problem

To train a neural network with gradient descent, we need:

∂Loss/∂wᵢⱼ for every weight wᵢⱼ

Naively computing each gradient separately would be incredibly slow. Backpropagation does it efficiently in one backward pass.

The Chain Rule

Backpropagation is just the chain rule applied systematically:

∂Loss/∂w = ∂Loss/∂output × ∂output/∂w

For deeper paths:

∂Loss/∂w = ∂Loss/∂L₃ × ∂L₃/∂L₂ × ∂L₂/∂L₁ × ∂L₁/∂w

Forward Pass vs Backward Pass

Forward Pass

Compute outputs layer by layer:

x → h₁ = σ(W₁x) → h₂ = σ(W₂h₁) → ... → output → loss

Save intermediate values (activations) for backward pass.

Backward Pass

Compute gradients from output to input:

∂L/∂output → ∂L/∂h₂ → ∂L/∂h₁ → ∂L/∂W₁, ∂L/∂W₂, ...

Reuse intermediate results (gradients flow backward).

Step by Step Example

Consider a simple network: x → Linear → ReLU → Linear → Loss

Forward Pass

z₁ = W₁x + b₁
a₁ = ReLU(z₁)
z₂ = W₂a₁ + b₂
L = Loss(z₂, y)

Backward Pass

∂L/∂z₂ = ∂Loss/∂z₂           # Derivative of loss
∂L/∂W₂ = ∂L/∂z₂ × a₁ᵀ        # Gradient for W₂
∂L/∂a₁ = W₂ᵀ × ∂L/∂z₂        # Gradient flows back
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)  # Through ReLU
∂L/∂W₁ = ∂L/∂z₁ × xᵀ         # Gradient for W₁

Why It's Efficient

Naive approach: For each weight, trace path from input to output. O(n × path_length) operations.

Backprop: Compute once, reuse gradients. O(n) operations - same as forward pass!

This efficiency is what makes training deep networks practical.

Computational Graphs

Modern frameworks (PyTorch, TensorFlow) build computational graphs:

     x        W₁
      \      /
       MatMul
         |
        ReLU
         |
        Loss

Each node knows its local gradient. Backprop traverses the graph backward, multiplying gradients.

Gradients Through Common Operations

Linear Layer: y = Wx + b

∂L/∂W = ∂L/∂y × xᵀ
∂L/∂x = Wᵀ × ∂L/∂y
∂L/∂b = ∂L/∂y

Activation Functions

ReLU:     ∂L/∂x = ∂L/∂y × (x > 0)
Sigmoid:  ∂L/∂x = ∂L/∂y × σ(x)(1-σ(x))
Tanh:     ∂L/∂x = ∂L/∂y × (1 - tanh²(x))

Softmax + Cross-Entropy

∂L/∂logits = softmax(logits) - one_hot(label)

Beautifully simple when combined!

Challenges

Vanishing Gradients

Gradients shrink exponentially in deep networks:

  • Each layer multiplies by something ≤ 1
  • Earlier layers learn very slowly

Solutions: ReLU, residual connections, batch norm, better initialization.

Exploding Gradients

Gradients grow exponentially:

  • Weights become unstable
  • Loss goes to infinity

Solutions: Gradient clipping, proper initialization, normalization.

Memory

Must store all activations for backward pass.

  • Memory scales with batch size × depth

Solutions: Gradient checkpointing (recompute instead of store).

Automatic Differentiation

Modern frameworks implement autodiff:

# PyTorch example
x = torch.tensor([1.0], requires_grad=True)
y = x**2 + 3*x
y.backward()  # Computes gradients automatically
print(x.grad)  # dy/dx = 2x + 3 = 5

You define forward computation; framework handles backward.

Not Just for Neural Networks

Backprop works for any differentiable computation:

  • Physics simulations
  • Graphics rendering
  • Control systems
  • Probabilistic programming

Key Takeaways

  1. Backpropagation efficiently computes gradients for all weights
  2. It's the chain rule applied systematically from output to input
  3. Requires storing forward pass activations
  4. Vanishing/exploding gradients are key challenges
  5. Modern frameworks handle this automatically (autodiff)