Dropout

Dropout is a regularization technique that randomly "drops" neurons during training. Despite its simplicity, it's remarkably effective at preventing overfitting in neural networks.

How It Works

During Training

Randomly set each neuron's output to zero with probability p:

mask = random_binary(shape, p=dropout_rate)
output = input × mask / (1 - p)  # Scale to maintain expected value

Typical dropout rates: 0.2-0.5

During Inference

Use all neurons (no dropout), but outputs are already scaled correctly.

Training:   [1, 2, 3, 4] × [1, 0, 1, 0] / 0.5 = [2, 0, 6, 0]
Inference:  [1, 2, 3, 4]  (no change)

Why Does It Work?

1. Prevents Co-adaptation

Neurons can't rely on specific other neurons being present. Forces distributed representations.

2. Ensemble Effect

Training with dropout ≈ training exponentially many "thinned" networks. At test time, we use an averaged model.

3. Implicit Regularization

Adds noise to training, similar to L2 regularization but adaptive.

4. Redundancy

Network must learn redundant representations - more robust.

Inverted Dropout (Modern Standard)

Original dropout: Scale at test time

Train: output = input × mask
Test:  output = input × (1 - p)

Inverted dropout: Scale at train time

Train: output = input × mask / (1 - p)
Test:  output = input  (no change!)

Inverted is better because:

Same code path at train and test
No scaling needed at inference
Easier to implement

Where to Apply Dropout

Fully Connected Layers

Most common placement:

Linear → ReLU → Dropout → Linear → ...

After the Final Layer

Don't dropout the output layer itself!

Convolutional Layers

Less common in modern CNNs
Spatial dropout: Drop entire feature maps
Often replaced by batch normalization

Recurrent Networks

Standard dropout between layers
Special handling needed within recurrent connections
Variational dropout: Same mask across time steps

Choosing Dropout Rate

Layer Type	Typical Rate
Input layer	0.1 - 0.2
Hidden layers	0.2 - 0.5
Large networks	Higher (0.5+)
Small networks	Lower (0.1 - 0.3)
Conv layers	0 - 0.2

Rule of thumb: Start with 0.5, adjust based on validation.

Dropout vs Other Regularization

Technique	How it regularizes
Dropout	Ensemble + noise
L2	Shrinks weights
L1	Sparsifies weights
Batch Norm	Normalizes activations
Data Augmentation	More diverse data

Often used together! But dropout + batch norm interaction can be tricky.

Variants

Spatial Dropout

Drop entire feature maps in CNNs:

mask shape: (batch, channels, 1, 1)  # Same for all spatial positions

Preserves spatial structure.

DropConnect

Drop weights instead of activations:

output = (W × mask) × input

More fine-grained but more expensive.

DropBlock

Drop contiguous regions in feature maps:

Forces network to look at different parts of image
Better than random spatial dropout

Alpha Dropout

For SELU activations, maintains self-normalizing property.

Attention Dropout

Drop attention weights in Transformers.

Modern Usage

In Transformers

MultiHeadAttention → Dropout
FeedForward → Dropout
Residual connections → Dropout

In CNNs

Less common now - batch normalization provides similar regularization.

In Practice

# PyTorch
self.dropout = nn.Dropout(p=0.5)
# Remember to call model.train() and model.eval()!

# TensorFlow/Keras
layer = tf.keras.layers.Dropout(rate=0.5)
# training=True/False handled automatically in fit()

Common Mistakes

Forgetting eval mode: Dropout at test time kills performance
Too much dropout: Network can't learn
Dropout before batch norm: Interaction can hurt
Same rate everywhere: Tune per layer type

Monte Carlo Dropout

Cool trick: Keep dropout ON at test time, run multiple times.

predictions = [model(x, training=True) for _ in range(N)]
mean = np.mean(predictions)        # Prediction
std = np.std(predictions)          # Uncertainty estimate!

Provides uncertainty estimates without Bayesian neural networks.

Key Takeaways

Dropout randomly zeros neurons during training
Prevents co-adaptation and acts as ensemble
Use inverted dropout (scale at train time)
Start with p=0.5, tune from there
Less common in modern CNNs (batch norm instead)
Still important in Transformers and FC networks