beginnerDeep Learning

Learn about pooling layers in CNNs - how they reduce spatial dimensions while preserving important features.

cnnpoolingdeep-learningcomputer-vision

Pooling Layers

Pooling layers reduce the spatial dimensions (width and height) of feature maps while retaining the most important information, reducing computation and providing translation invariance.

Why Pooling?

Without Pooling:
Input: 224×224 → Conv → 224×224 → Conv → 224×224 ...
                       Many parameters, computationally expensive

With Pooling:
Input: 224×224 → Conv → 224×224 → Pool → 112×112 → Conv → Pool → 56×56
                                         Progressively smaller

Benefits

  1. Dimensionality reduction: Fewer parameters, faster training
  2. Translation invariance: Small shifts don't change output
  3. Noise reduction: Averages out small variations
  4. Larger receptive field: Each neuron "sees" more of the input

Types of Pooling

Max Pooling

Takes the maximum value in each window:

Input (4×4):               Max Pool (2×2, stride 2):
┌───┬───┬───┬───┐         ┌───┬───┐
│ 1 │ 3 │ 2 │ 1 │         │ 4 │ 6 │
├───┼───┼───┼───┤   →     ├───┼───┤
│ 4 │ 2 │ 6 │ 4 │         │ 8 │ 9 │
├───┼───┼───┼───┤         └───┴───┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤         Output: 2×2
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘

Top-left window [1,3,4,2] → max = 4
Top-right window [2,1,6,4] → max = 6

Properties:

  • Preserves strongest activations (detected features)
  • Good for sparse features
  • Most common in CNNs

Average Pooling

Takes the average value in each window:

Input (4×4):               Avg Pool (2×2, stride 2):
┌───┬───┬───┬───┐         ┌─────┬─────┐
│ 1 │ 3 │ 2 │ 1 │         │ 2.5 │ 3.25│
├───┼───┼───┼───┤   →     ├─────┼─────┤
│ 4 │ 2 │ 6 │ 4 │         │ 4.0 │ 4.75│
├───┼───┼───┼───┤         └─────┴─────┘
│ 5 │ 8 │ 3 │ 2 │
├───┼───┼───┼───┤
│ 2 │ 1 │ 9 │ 5 │
└───┴───┴───┴───┘

Top-left window [1,3,4,2] → avg = 2.5

Properties:

  • Smoother output
  • Good for dense features
  • Used in later layers (global average pooling)

Global Average Pooling (GAP)

Averages entire feature map to single value:

Input: 7×7×512 feature maps
           ↓
GAP: Average each 7×7 map to 1 value
           ↓
Output: 1×1×512

# Then flatten to [512] for classification

Advantages:

  • No learned parameters
  • Reduces overfitting
  • Enables variable input sizes
  • Used in modern architectures (ResNet, EfficientNet)

Implementation

PyTorch

import torch
import torch.nn as nn

# Max Pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(1, 64, 32, 32)  # [batch, channels, height, width]
out = max_pool(x)  # [1, 64, 16, 16]

# Average Pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out = avg_pool(x)  # [1, 64, 16, 16]

# Global Average Pooling
gap = nn.AdaptiveAvgPool2d(1)  # Output size 1×1
out = gap(x)  # [1, 64, 1, 1]
out = out.view(out.size(0), -1)  # [1, 64] - flatten

TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D, GlobalAveragePooling2D

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    tf.keras.layers.Conv2D(128, 3, activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    
    tf.keras.layers.Conv2D(256, 3, activation='relu'),
    GlobalAveragePooling2D(),  # Replace flatten + dense
    
    tf.keras.layers.Dense(10, activation='softmax')
])

Pooling Parameters

Kernel Size

2×2 (most common): Halves spatial dimensions
3×3: Reduces by 1/3 with stride 3

Stride

Stride = kernel_size (non-overlapping): Standard
Stride < kernel_size (overlapping): More information preserved

Example with 3×3 kernel:
Stride 3: 9×9 → 3×3 (non-overlapping)
Stride 2: 9×9 → 4×4 (overlapping)

Padding

# Same padding: output same size as input
nn.MaxPool2d(kernel_size=3, stride=1, padding=1)

# Valid padding (no padding): output smaller
nn.MaxPool2d(kernel_size=2, stride=2, padding=0)

Output Size Calculation

Output = floor((Input + 2×Padding - Kernel) / Stride) + 1

Example:
Input: 32×32
Kernel: 2×2
Stride: 2
Padding: 0

Output = floor((32 + 0 - 2) / 2) + 1 = 16

Pooling vs Strided Convolution

AspectPoolingStrided Conv
ParametersNoneLearned
FlexibilityFixed operationLearns downsampling
ComputationVery fastSlower
Translation invarianceStrong (max pool)Weaker

Modern trend: Strided convolutions sometimes replace pooling.

# Traditional
nn.Conv2d(64, 128, kernel_size=3, padding=1)
nn.MaxPool2d(2, 2)

# Modern alternative
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)  # Learns to downsample

Special Pooling Types

L2 Pooling

# Square root of sum of squares
def l2_pool(x, kernel_size=2, stride=2):
    x_squared = x ** 2
    summed = F.avg_pool2d(x_squared, kernel_size, stride) * (kernel_size ** 2)
    return torch.sqrt(summed)

Spatial Pyramid Pooling (SPP)

Input: Variable size
       ↓
┌──────────────────────────────┐
│ Pool to 4×4  Pool to 2×2  Pool to 1×1 │
└──────────────────────────────┘
       ↓
Concatenate: 16 + 4 + 1 = 21 features per channel
       ↓
Fixed-size output regardless of input size

Pooling in Different Architectures

LeNet / AlexNet

Conv → Pool → Conv → Pool → FC → FC
(2×2 max pooling with stride 2)

VGG

Conv → Conv → Pool → Conv → Conv → Pool → ...
(2×2 max pooling after conv blocks)

ResNet

Conv 7×7 stride 2 → Max Pool 3×3 stride 2 → ... → GAP → FC
(Initial downsampling, then GAP at end)

Modern CNNs

Often use strided convolutions instead of pooling
Global Average Pooling before final classifier

Key Takeaways

  1. Pooling reduces spatial dimensions while preserving features
  2. Max pooling keeps strongest activations, most common choice
  3. Average pooling provides smoother features
  4. Global Average Pooling replaces fully connected layers
  5. Output size = floor((input + 2×padding - kernel) / stride) + 1
  6. Modern networks sometimes use strided convolutions instead