Pooling Layers
Pooling layers reduce the spatial dimensions (width and height) of feature maps while retaining the most important information, reducing computation and providing translation invariance.
Why Pooling?
Without Pooling:
Input: 224×224 → Conv → 224×224 → Conv → 224×224 ...
Many parameters, computationally expensive
With Pooling:
Input: 224×224 → Conv → 224×224 → Pool → 112×112 → Conv → Pool → 56×56
Progressively smaller
Benefits
- Dimensionality reduction: Fewer parameters, faster training
- Translation invariance: Small shifts don't change output
- Noise reduction: Averages out small variations
- Larger receptive field: Each neuron "sees" more of the input
Types of Pooling
Max Pooling
Takes the maximum value in each window:
Input (4×4): Max Pool (2×2, stride 2):
┌───┬───┬───┬───┐ ┌───┬───┐
│ 1 │ 3 │ 2 │ 1 │ │ 4 │ 6 │
├───┼───┼───┼───┤ → ├───┼───┤
│ 4 │ 2 │ 6 │ 4 │ │ 8 │ 9 │
├───┼───┼───┼───┤ └───┴───┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤ Output: 2×2
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘
Top-left window [1,3,4,2] → max = 4
Top-right window [2,1,6,4] → max = 6
Properties:
- Preserves strongest activations (detected features)
- Good for sparse features
- Most common in CNNs
Average Pooling
Takes the average value in each window:
Input (4×4): Avg Pool (2×2, stride 2):
┌───┬───┬───┬───┐ ┌─────┬─────┐
│ 1 │ 3 │ 2 │ 1 │ │ 2.5 │ 3.25│
├───┼───┼───┼───┤ → ├─────┼─────┤
│ 4 │ 2 │ 6 │ 4 │ │ 4.0 │ 4.75│
├───┼───┼───┼───┤ └─────┴─────┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘
Top-left window [1,3,4,2] → avg = 2.5
Properties:
- Smoother output
- Good for dense features
- Used in later layers (global average pooling)
Global Average Pooling (GAP)
Averages entire feature map to single value:
Input: 7×7×512 feature maps
↓
GAP: Average each 7×7 map to 1 value
↓
Output: 1×1×512
# Then flatten to [512] for classification
Advantages:
- No learned parameters
- Reduces overfitting
- Enables variable input sizes
- Used in modern architectures (ResNet, EfficientNet)
Implementation
PyTorch
import torch
import torch.nn as nn
# Max Pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(1, 64, 32, 32) # [batch, channels, height, width]
out = max_pool(x) # [1, 64, 16, 16]
# Average Pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out = avg_pool(x) # [1, 64, 16, 16]
# Global Average Pooling
gap = nn.AdaptiveAvgPool2d(1) # Output size 1×1
out = gap(x) # [1, 64, 1, 1]
out = out.view(out.size(0), -1) # [1, 64] - flatten
TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D, GlobalAveragePooling2D
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(64, 3, activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(128, 3, activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(256, 3, activation='relu'),
GlobalAveragePooling2D(), # Replace flatten + dense
tf.keras.layers.Dense(10, activation='softmax')
])
Pooling Parameters
Kernel Size
2×2 (most common): Halves spatial dimensions
3×3: Reduces by 1/3 with stride 3
Stride
Stride = kernel_size (non-overlapping): Standard
Stride < kernel_size (overlapping): More information preserved
Example with 3×3 kernel:
Stride 3: 9×9 → 3×3 (non-overlapping)
Stride 2: 9×9 → 4×4 (overlapping)
Padding
# Same padding: output same size as input
nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
# Valid padding (no padding): output smaller
nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
Output Size Calculation
Output = floor((Input + 2×Padding - Kernel) / Stride) + 1
Example:
Input: 32×32
Kernel: 2×2
Stride: 2
Padding: 0
Output = floor((32 + 0 - 2) / 2) + 1 = 16
Pooling vs Strided Convolution
| Aspect | Pooling | Strided Conv |
|---|---|---|
| Parameters | None | Learned |
| Flexibility | Fixed operation | Learns downsampling |
| Computation | Very fast | Slower |
| Translation invariance | Strong (max pool) | Weaker |
Modern trend: Strided convolutions sometimes replace pooling.
# Traditional
nn.Conv2d(64, 128, kernel_size=3, padding=1)
nn.MaxPool2d(2, 2)
# Modern alternative
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1) # Learns to downsample
Special Pooling Types
L2 Pooling
# Square root of sum of squares
def l2_pool(x, kernel_size=2, stride=2):
x_squared = x ** 2
summed = F.avg_pool2d(x_squared, kernel_size, stride) * (kernel_size ** 2)
return torch.sqrt(summed)
Spatial Pyramid Pooling (SPP)
Input: Variable size
↓
┌──────────────────────────────┐
│ Pool to 4×4 Pool to 2×2 Pool to 1×1 │
└──────────────────────────────┘
↓
Concatenate: 16 + 4 + 1 = 21 features per channel
↓
Fixed-size output regardless of input size
Pooling in Different Architectures
LeNet / AlexNet
Conv → Pool → Conv → Pool → FC → FC
(2×2 max pooling with stride 2)
VGG
Conv → Conv → Pool → Conv → Conv → Pool → ...
(2×2 max pooling after conv blocks)
ResNet
Conv 7×7 stride 2 → Max Pool 3×3 stride 2 → ... → GAP → FC
(Initial downsampling, then GAP at end)
Modern CNNs
Often use strided convolutions instead of pooling
Global Average Pooling before final classifier
Key Takeaways
- Pooling reduces spatial dimensions while preserving features
- Max pooling keeps strongest activations, most common choice
- Average pooling provides smoother features
- Global Average Pooling replaces fully connected layers
- Output size = floor((input + 2×padding - kernel) / stride) + 1
- Modern networks sometimes use strided convolutions instead