Residual Connections (Skip Connections)
Residual connections revolutionized deep learning by enabling training of networks with hundreds or thousands of layers. They're now a standard component in modern architectures.
The Problem with Deep Networks
Vanishing Gradients
In deep networks, gradients shrink through layers:
∂L/∂w₁ = ∂L/∂h₁₀ × ∂h₁₀/∂h₉ × ... × ∂h₂/∂h₁ × ∂h₁/∂w₁
If each term < 1 → Product vanishes
Degradation Problem
Surprisingly, deeper networks performed worse than shallower ones (even on training data). This wasn't overfitting—it was optimization difficulty.
The Residual Solution
Basic Idea
Instead of learning H(x), learn the residual F(x) = H(x) - x:
H(x) = F(x) + x
output = F(x) + x
= transformation + identity
Visual Representation
┌──────────────────────────┐
│ │
│ (skip) │
│ ↓
x ─────┴──→ [Conv] → [ReLU] → [Conv] ─→ (+) → ReLU → output
F(x) ↑
x
Why It Works
1. Gradient Highway
During backpropagation:
∂L/∂x = ∂L/∂output × (∂F/∂x + 1)
↑
Always at least 1!
The gradient can flow directly through the skip connection.
2. Easy to Learn Identity
If optimal is near identity:
- Without skip: Network must learn H(x) = x (hard)
- With skip: Network must learn F(x) = 0 (easy)
3. Ensemble Effect
ResNets can be viewed as ensembles of shallower networks.
Types of Skip Connections
Identity Shortcut
output = F(x) + x
Requires F(x) and x to have same dimensions.
Projection Shortcut
When dimensions don't match:
output = F(x) + W_s × x
Use 1×1 convolution to match dimensions.
Pre-activation ResNet
Batch norm and activation before convolution:
x → BN → ReLU → Conv → BN → ReLU → Conv → (+) →
Better gradient flow, commonly used.
Architectures Using Skip Connections
ResNet
The original, stacked residual blocks:
Conv → [Res Block] × N → [Res Block] × N → ... → FC
DenseNet
Connect every layer to every other layer:
x₁ = H₁(x₀)
x₂ = H₂([x₀, x₁])
x₃ = H₃([x₀, x₁, x₂])
Maximum feature reuse.
U-Net
Skip connections between encoder and decoder:
Encoder Decoder
[E1] ─────────→ [D1]
↓ ↑
[E2] ─────────→ [D2]
↓ ↑
[E3] ─────────→ [D3]
Bottleneck
Critical for segmentation.
Transformer
Residual + LayerNorm around attention and FFN:
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
Implementation
Basic Residual Block
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += residual # Skip connection
return F.relu(out)
With Projection
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(out_channels)
# Projection shortcut if dimensions change
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
return F.relu(out)
Variations and Extensions
Stochastic Depth
Randomly drop residual blocks during training:
output = x + p × F(x) # p=1 always, p=0 skip block
Regularization + faster training.
Squeeze-and-Excitation
Scale residual by channel importance:
output = x + SE(F(x)) × F(x)
ReZero
Learnable scaling initialized to zero:
output = x + α × F(x) # α starts at 0
Faster early training.
Impact and Results
Enabled Depth
| Network | Layers | Top-5 Error |
|---|---|---|
| VGG-19 | 19 | 9.0% |
| ResNet-34 | 34 | 5.7% |
| ResNet-152 | 152 | 4.5% |
| ResNet-1001 | 1001 | Works! |
Now Everywhere
- Computer vision: ResNet, EfficientNet
- NLP: Transformers, BERT, GPT
- Speech: Conformer
- RL: Many deep RL architectures
Key Takeaways
- Skip connections add input to output: y = F(x) + x
- Enables gradient flow in very deep networks
- Makes learning identity mapping easy
- Foundation of ResNet, Transformers, U-Net
- Almost always beneficial—no reason not to use them
- Combine with normalization for best results