Convolutional Neural Networks (CNNs)
Convolutional Neural Networks are specialized neural networks designed for processing grid-like data, especially images. They've revolutionized computer vision and remain the backbone of most image AI systems.
Why Not Fully Connected?
For a 224×224 color image:
- Input size: 224 × 224 × 3 = 150,528 values
- FC layer with 1000 neurons: 150 million parameters!
Problems:
- Too many parameters: Overfitting, slow training
- Ignores spatial structure: Adjacent pixels are related
- Not translation invariant: Cat in corner vs center looks different
The Convolution Operation
Slide a small filter (kernel) across the image:
Input Filter Output (Feature Map)
[1 2 3 4]
[5 6 7 8] * [1 0] = [12, 16, 20]
[9 0 1 2] [0 1] [6, 8, 10]
Calculation: 1×1 + 2×0 + 5×0 + 6×1 = 7... (simplified)
Each filter detects one type of feature (edge, texture, etc.).
Key Properties
1. Local Connectivity
Neurons only connect to a local region of input.
- Exploits spatial locality
- Dramatically fewer parameters
2. Weight Sharing
Same filter weights applied across entire image.
- Detects feature anywhere in image
- Translation equivariance
3. Multiple Filters
Many filters → many feature maps.
- Different filters detect different features
- Channels in deeper layers
CNN Building Blocks
Convolutional Layer
Parameters:
- Number of filters (output channels)
- Kernel size (usually 3×3 or 5×5)
- Stride (step size, usually 1 or 2)
- Padding (usually 'same' or 'valid')
Output size:
H_out = (H_in + 2×padding - kernel_size) / stride + 1
Pooling Layer
Downsamples feature maps:
Max Pooling (most common):
[1 3] Max
[2 4] ────→ [4] (take max of 2×2 region)
Average Pooling: Take mean instead.
Global Average Pooling: Average entire feature map to single value.
Activation (ReLU)
Applied element-wise after convolution.
Fully Connected Layers
At the end for classification:
Conv → Pool → Conv → Pool → Flatten → FC → Softmax
CNN Architecture Pattern
[Conv → ReLU] × N → Pool → [Conv → ReLU] × M → Pool → ... → FC → Output
As you go deeper:
- Spatial dimensions ↓ (pooling)
- Number of channels ↑ (more filters)
- Features become more abstract
Famous Architectures
LeNet (1998)
First successful CNN. Digit recognition.
AlexNet (2012)
Started deep learning revolution. ImageNet winner.
- Deep (8 layers)
- ReLU activation
- Dropout
- GPU training
VGGNet (2014)
Simplicity: only 3×3 convolutions.
- Deep (16-19 layers)
- Shows depth matters
ResNet (2015)
Residual connections enable very deep networks:
output = F(x) + x (skip connection)
- Up to 152+ layers
- Solves vanishing gradient
- Still widely used
Modern Architectures
- EfficientNet: Balanced scaling
- ConvNeXt: CNN matching Transformers
- Vision Transformers (ViT): Pure attention (not CNN)
Receptive Field
The region of input that affects a neuron:
Layer 1 neuron: sees 3×3
Layer 2 neuron: sees 5×5 (3×3 of layer 1 outputs)
Layer 3 neuron: sees 7×7
...
Deeper → larger receptive field → more global features.
1×1 Convolutions
Surprisingly useful:
- Change number of channels
- Add non-linearity
- Reduce parameters
- Used in Inception, ResNet bottlenecks
Transfer Learning
CNNs pretrained on ImageNet transfer to other tasks:
- Feature extraction: Use pretrained CNN as fixed feature extractor
- Fine-tuning: Train pretrained CNN on new task with small learning rate
Works because early layers learn generic features (edges, textures).
Beyond Images
CNNs work on any grid-like data:
- 1D convolutions: Sequences, time series, audio
- 3D convolutions: Video, medical imaging
- Graph convolutions: Molecular data, social networks
Key Takeaways
- CNNs use convolutions for spatial feature extraction
- Local connectivity + weight sharing = fewer parameters
- Deeper layers capture more abstract features
- Residual connections enable very deep networks
- Pretrained CNNs are great feature extractors
- Vision Transformers are emerging as alternative