Dropout
Dropout is a regularization technique that randomly "drops" neurons during training. Despite its simplicity, it's remarkably effective at preventing overfitting in neural networks.
How It Works
During Training
Randomly set each neuron's output to zero with probability p:
mask = random_binary(shape, p=dropout_rate)
output = input × mask / (1 - p) # Scale to maintain expected value
Typical dropout rates: 0.2-0.5
During Inference
Use all neurons (no dropout), but outputs are already scaled correctly.
Training: [1, 2, 3, 4] × [1, 0, 1, 0] / 0.5 = [2, 0, 6, 0]
Inference: [1, 2, 3, 4] (no change)
Why Does It Work?
1. Prevents Co-adaptation
Neurons can't rely on specific other neurons being present. Forces distributed representations.
2. Ensemble Effect
Training with dropout ≈ training exponentially many "thinned" networks. At test time, we use an averaged model.
3. Implicit Regularization
Adds noise to training, similar to L2 regularization but adaptive.
4. Redundancy
Network must learn redundant representations - more robust.
Inverted Dropout (Modern Standard)
Original dropout: Scale at test time
Train: output = input × mask
Test: output = input × (1 - p)
Inverted dropout: Scale at train time
Train: output = input × mask / (1 - p)
Test: output = input (no change!)
Inverted is better because:
- Same code path at train and test
- No scaling needed at inference
- Easier to implement
Where to Apply Dropout
Fully Connected Layers
Most common placement:
Linear → ReLU → Dropout → Linear → ...
After the Final Layer
Don't dropout the output layer itself!
Convolutional Layers
- Less common in modern CNNs
- Spatial dropout: Drop entire feature maps
- Often replaced by batch normalization
Recurrent Networks
- Standard dropout between layers
- Special handling needed within recurrent connections
- Variational dropout: Same mask across time steps
Choosing Dropout Rate
| Layer Type | Typical Rate |
|---|---|
| Input layer | 0.1 - 0.2 |
| Hidden layers | 0.2 - 0.5 |
| Large networks | Higher (0.5+) |
| Small networks | Lower (0.1 - 0.3) |
| Conv layers | 0 - 0.2 |
Rule of thumb: Start with 0.5, adjust based on validation.
Dropout vs Other Regularization
| Technique | How it regularizes |
|---|---|
| Dropout | Ensemble + noise |
| L2 | Shrinks weights |
| L1 | Sparsifies weights |
| Batch Norm | Normalizes activations |
| Data Augmentation | More diverse data |
Often used together! But dropout + batch norm interaction can be tricky.
Variants
Spatial Dropout
Drop entire feature maps in CNNs:
mask shape: (batch, channels, 1, 1) # Same for all spatial positions
Preserves spatial structure.
DropConnect
Drop weights instead of activations:
output = (W × mask) × input
More fine-grained but more expensive.
DropBlock
Drop contiguous regions in feature maps:
- Forces network to look at different parts of image
- Better than random spatial dropout
Alpha Dropout
For SELU activations, maintains self-normalizing property.
Attention Dropout
Drop attention weights in Transformers.
Modern Usage
In Transformers
MultiHeadAttention → Dropout
FeedForward → Dropout
Residual connections → Dropout
In CNNs
Less common now - batch normalization provides similar regularization.
In Practice
# PyTorch
self.dropout = nn.Dropout(p=0.5)
# Remember to call model.train() and model.eval()!
# TensorFlow/Keras
layer = tf.keras.layers.Dropout(rate=0.5)
# training=True/False handled automatically in fit()
Common Mistakes
- Forgetting eval mode: Dropout at test time kills performance
- Too much dropout: Network can't learn
- Dropout before batch norm: Interaction can hurt
- Same rate everywhere: Tune per layer type
Monte Carlo Dropout
Cool trick: Keep dropout ON at test time, run multiple times.
predictions = [model(x, training=True) for _ in range(N)]
mean = np.mean(predictions) # Prediction
std = np.std(predictions) # Uncertainty estimate!
Provides uncertainty estimates without Bayesian neural networks.
Key Takeaways
- Dropout randomly zeros neurons during training
- Prevents co-adaptation and acts as ensemble
- Use inverted dropout (scale at train time)
- Start with p=0.5, tune from there
- Less common in modern CNNs (batch norm instead)
- Still important in Transformers and FC networks