Residual Connections (Skip Connections)

Residual connections revolutionized deep learning by enabling training of networks with hundreds or thousands of layers. They're now a standard component in modern architectures.

The Problem with Deep Networks

Vanishing Gradients

In deep networks, gradients shrink through layers:

∂L/∂w₁ = ∂L/∂h₁₀ × ∂h₁₀/∂h₉ × ... × ∂h₂/∂h₁ × ∂h₁/∂w₁

If each term < 1 → Product vanishes

Degradation Problem

Surprisingly, deeper networks performed worse than shallower ones (even on training data). This wasn't overfitting—it was optimization difficulty.

The Residual Solution

Basic Idea

Instead of learning H(x), learn the residual F(x) = H(x) - x:

H(x) = F(x) + x

output = F(x) + x
       = transformation + identity

Visual Representation

       ┌──────────────────────────┐
       │                          │
       │         (skip)           │
       │                          ↓
x ─────┴──→ [Conv] → [ReLU] → [Conv] ─→ (+) → ReLU → output
                  F(x)                   ↑
                                         x

Why It Works

1. Gradient Highway

During backpropagation:

∂L/∂x = ∂L/∂output × (∂F/∂x + 1)
                      ↑
              Always at least 1!

The gradient can flow directly through the skip connection.

2. Easy to Learn Identity

If optimal is near identity:

Without skip: Network must learn H(x) = x (hard)
With skip: Network must learn F(x) = 0 (easy)

3. Ensemble Effect

ResNets can be viewed as ensembles of shallower networks.

Types of Skip Connections

Identity Shortcut

output = F(x) + x

Requires F(x) and x to have same dimensions.

Projection Shortcut

When dimensions don't match:

output = F(x) + W_s × x

Use 1×1 convolution to match dimensions.

Pre-activation ResNet

Batch norm and activation before convolution:

x → BN → ReLU → Conv → BN → ReLU → Conv → (+) →

Better gradient flow, commonly used.

Architectures Using Skip Connections

ResNet

The original, stacked residual blocks:

Conv → [Res Block] × N → [Res Block] × N → ... → FC

DenseNet

Connect every layer to every other layer:

x₁ = H₁(x₀)
x₂ = H₂([x₀, x₁])
x₃ = H₃([x₀, x₁, x₂])

Maximum feature reuse.

U-Net

Skip connections between encoder and decoder:

Encoder         Decoder
[E1] ─────────→ [D1]
  ↓               ↑
[E2] ─────────→ [D2]
  ↓               ↑
[E3] ─────────→ [D3]
       Bottleneck

Critical for segmentation.

Transformer

Residual + LayerNorm around attention and FFN:

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Implementation

Basic Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
    
    def forward(self, x):
        residual = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += residual  # Skip connection
        return F.relu(out)

With Projection

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 
                               stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Projection shortcut if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return F.relu(out)

Variations and Extensions

Stochastic Depth

Randomly drop residual blocks during training:

output = x + p × F(x)  # p=1 always, p=0 skip block

Regularization + faster training.

Squeeze-and-Excitation

Scale residual by channel importance:

output = x + SE(F(x)) × F(x)

ReZero

Learnable scaling initialized to zero:

output = x + α × F(x)  # α starts at 0

Faster early training.

Impact and Results

Enabled Depth

Network	Layers	Top-5 Error
VGG-19	19	9.0%
ResNet-34	34	5.7%
ResNet-152	152	4.5%
ResNet-1001	1001	Works!

Now Everywhere

Computer vision: ResNet, EfficientNet
NLP: Transformers, BERT, GPT
Speech: Conformer
RL: Many deep RL architectures

Key Takeaways

Skip connections add input to output: y = F(x) + x
Enables gradient flow in very deep networks
Makes learning identity mapping easy
Foundation of ResNet, Transformers, U-Net
Almost always beneficial—no reason not to use them
Combine with normalization for best results