Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are specialized neural networks designed for processing grid-like data, especially images. They've revolutionized computer vision and remain the backbone of most image AI systems.

Why Not Fully Connected?

For a 224×224 color image:

Input size: 224 × 224 × 3 = 150,528 values
FC layer with 1000 neurons: 150 million parameters!

Problems:

Too many parameters: Overfitting, slow training
Ignores spatial structure: Adjacent pixels are related
Not translation invariant: Cat in corner vs center looks different

The Convolution Operation

Slide a small filter (kernel) across the image:

Input           Filter       Output (Feature Map)
[1 2 3 4]                    
[5 6 7 8]   *   [1 0]    =   [12, 16, 20]
[9 0 1 2]       [0 1]        [6,  8,  10]

Calculation: 1×1 + 2×0 + 5×0 + 6×1 = 7... (simplified)

Each filter detects one type of feature (edge, texture, etc.).

Key Properties

1. Local Connectivity

Neurons only connect to a local region of input.

Exploits spatial locality
Dramatically fewer parameters

2. Weight Sharing

Same filter weights applied across entire image.

Detects feature anywhere in image
Translation equivariance

3. Multiple Filters

Many filters → many feature maps.

Different filters detect different features
Channels in deeper layers

CNN Building Blocks

Convolutional Layer

Parameters:
- Number of filters (output channels)
- Kernel size (usually 3×3 or 5×5)
- Stride (step size, usually 1 or 2)
- Padding (usually 'same' or 'valid')

Output size:

H_out = (H_in + 2×padding - kernel_size) / stride + 1

Pooling Layer

Downsamples feature maps:

Max Pooling (most common):

[1 3]   Max    
[2 4]  ────→  [4]  (take max of 2×2 region)

Average Pooling: Take mean instead.

Global Average Pooling: Average entire feature map to single value.

Activation (ReLU)

Applied element-wise after convolution.

Fully Connected Layers

At the end for classification:

Conv → Pool → Conv → Pool → Flatten → FC → Softmax

CNN Architecture Pattern

[Conv → ReLU] × N → Pool → [Conv → ReLU] × M → Pool → ... → FC → Output

As you go deeper:

Spatial dimensions ↓ (pooling)
Number of channels ↑ (more filters)
Features become more abstract

Famous Architectures

LeNet (1998)

First successful CNN. Digit recognition.

AlexNet (2012)

Started deep learning revolution. ImageNet winner.

Deep (8 layers)
ReLU activation
Dropout
GPU training

VGGNet (2014)

Simplicity: only 3×3 convolutions.

Deep (16-19 layers)
Shows depth matters

ResNet (2015)

Residual connections enable very deep networks:

output = F(x) + x  (skip connection)

Up to 152+ layers
Solves vanishing gradient
Still widely used

Modern Architectures

EfficientNet: Balanced scaling
ConvNeXt: CNN matching Transformers
Vision Transformers (ViT): Pure attention (not CNN)

Receptive Field

The region of input that affects a neuron:

Layer 1 neuron: sees 3×3
Layer 2 neuron: sees 5×5 (3×3 of layer 1 outputs)
Layer 3 neuron: sees 7×7
...

Deeper → larger receptive field → more global features.

1×1 Convolutions

Surprisingly useful:

Change number of channels
Add non-linearity
Reduce parameters
Used in Inception, ResNet bottlenecks

Transfer Learning

CNNs pretrained on ImageNet transfer to other tasks:

Feature extraction: Use pretrained CNN as fixed feature extractor
Fine-tuning: Train pretrained CNN on new task with small learning rate

Works because early layers learn generic features (edges, textures).

Beyond Images

CNNs work on any grid-like data:

1D convolutions: Sequences, time series, audio
3D convolutions: Video, medical imaging
Graph convolutions: Molecular data, social networks

Key Takeaways

CNNs use convolutions for spatial feature extraction
Local connectivity + weight sharing = fewer parameters
Deeper layers capture more abstract features
Residual connections enable very deep networks
Pretrained CNNs are great feature extractors
Vision Transformers are emerging as alternative