Pooling Layers

Pooling layers reduce the spatial dimensions (width and height) of feature maps while retaining the most important information, reducing computation and providing translation invariance.

Why Pooling?

Without Pooling:
Input: 224×224 → Conv → 224×224 → Conv → 224×224 ...
                       Many parameters, computationally expensive

With Pooling:
Input: 224×224 → Conv → 224×224 → Pool → 112×112 → Conv → Pool → 56×56
                                         Progressively smaller

Benefits

Dimensionality reduction: Fewer parameters, faster training
Translation invariance: Small shifts don't change output
Noise reduction: Averages out small variations
Larger receptive field: Each neuron "sees" more of the input

Types of Pooling

Max Pooling

Takes the maximum value in each window:

Input (4×4):               Max Pool (2×2, stride 2):
┌───┬───┬───┬───┐         ┌───┬───┐
│ 1 │ 3 │ 2 │ 1 │         │ 4 │ 6 │
├───┼───┼───┼───┤   →     ├───┼───┤
│ 4 │ 2 │ 6 │ 4 │         │ 8 │ 9 │
├───┼───┼───┼───┤         └───┴───┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤         Output: 2×2
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘

Top-left window [1,3,4,2] → max = 4
Top-right window [2,1,6,4] → max = 6

Properties:

Preserves strongest activations (detected features)
Good for sparse features
Most common in CNNs

Average Pooling

Takes the average value in each window:

Input (4×4):               Avg Pool (2×2, stride 2):
┌───┬───┬───┬───┐         ┌─────┬─────┐
│ 1 │ 3 │ 2 │ 1 │         │ 2.5 │ 3.25│
├───┼───┼───┼───┤   →     ├─────┼─────┤
│ 4 │ 2 │ 6 │ 4 │         │ 4.0 │ 4.75│
├───┼───┼───┼───┤         └─────┴─────┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘

Top-left window [1,3,4,2] → avg = 2.5

Properties:

Smoother output
Good for dense features
Used in later layers (global average pooling)

Global Average Pooling (GAP)

Averages entire feature map to single value:

Input: 7×7×512 feature maps
           ↓
GAP: Average each 7×7 map to 1 value
           ↓
Output: 1×1×512

# Then flatten to [512] for classification

Advantages:

No learned parameters
Reduces overfitting
Enables variable input sizes
Used in modern architectures (ResNet, EfficientNet)

Implementation

PyTorch

import torch
import torch.nn as nn

# Max Pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(1, 64, 32, 32)  # [batch, channels, height, width]
out = max_pool(x)  # [1, 64, 16, 16]

# Average Pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out = avg_pool(x)  # [1, 64, 16, 16]

# Global Average Pooling
gap = nn.AdaptiveAvgPool2d(1)  # Output size 1×1
out = gap(x)  # [1, 64, 1, 1]
out = out.view(out.size(0), -1)  # [1, 64] - flatten

TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D, GlobalAveragePooling2D

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    tf.keras.layers.Conv2D(128, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    tf.keras.layers.Conv2D(256, 3, activation='relu'),
    GlobalAveragePooling2D(),  # Replace flatten + dense
    
    tf.keras.layers.Dense(10, activation='softmax')
])

Pooling Parameters

Kernel Size

2×2 (most common): Halves spatial dimensions
3×3: Reduces by 1/3 with stride 3

Stride

Stride = kernel_size (non-overlapping): Standard
Stride < kernel_size (overlapping): More information preserved

Example with 3×3 kernel:
Stride 3: 9×9 → 3×3 (non-overlapping)
Stride 2: 9×9 → 4×4 (overlapping)

Padding

# Same padding: output same size as input
nn.MaxPool2d(kernel_size=3, stride=1, padding=1)

# Valid padding (no padding): output smaller
nn.MaxPool2d(kernel_size=2, stride=2, padding=0)

Output Size Calculation

Output = floor((Input + 2×Padding - Kernel) / Stride) + 1

Example:
Input: 32×32
Kernel: 2×2
Stride: 2
Padding: 0

Output = floor((32 + 0 - 2) / 2) + 1 = 16

Pooling vs Strided Convolution

Aspect	Pooling	Strided Conv
Parameters	None	Learned
Flexibility	Fixed operation	Learns downsampling
Computation	Very fast	Slower
Translation invariance	Strong (max pool)	Weaker

Modern trend: Strided convolutions sometimes replace pooling.

# Traditional
nn.Conv2d(64, 128, kernel_size=3, padding=1)
nn.MaxPool2d(2, 2)

# Modern alternative
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)  # Learns to downsample

Special Pooling Types

L2 Pooling

# Square root of sum of squares
def l2_pool(x, kernel_size=2, stride=2):
    x_squared = x ** 2
    summed = F.avg_pool2d(x_squared, kernel_size, stride) * (kernel_size ** 2)
    return torch.sqrt(summed)

Spatial Pyramid Pooling (SPP)

Input: Variable size
       ↓
┌──────────────────────────────┐
│ Pool to 4×4  Pool to 2×2  Pool to 1×1 │
└──────────────────────────────┘
       ↓
Concatenate: 16 + 4 + 1 = 21 features per channel
       ↓
Fixed-size output regardless of input size

Pooling in Different Architectures

LeNet / AlexNet

Conv → Pool → Conv → Pool → FC → FC
(2×2 max pooling with stride 2)

VGG

Conv → Conv → Pool → Conv → Conv → Pool → ...
(2×2 max pooling after conv blocks)

ResNet

Conv 7×7 stride 2 → Max Pool 3×3 stride 2 → ... → GAP → FC
(Initial downsampling, then GAP at end)

Modern CNNs

Often use strided convolutions instead of pooling
Global Average Pooling before final classifier

Key Takeaways

Pooling reduces spatial dimensions while preserving features
Max pooling keeps strongest activations, most common choice
Average pooling provides smoother features
Global Average Pooling replaces fully connected layers
Output size = floor((input + 2×padding - kernel) / stride) + 1
Modern networks sometimes use strided convolutions instead