intermediateDeep Learning

Master CNNs - the neural network architecture designed for images, using convolutions to learn spatial hierarchies of features.

computer-visionneural-networksconvolutionimage-classification

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are specialized neural networks designed for processing grid-like data, especially images. They've revolutionized computer vision and remain the backbone of most image AI systems.

Why Not Fully Connected?

For a 224×224 color image:

  • Input size: 224 × 224 × 3 = 150,528 values
  • FC layer with 1000 neurons: 150 million parameters!

Problems:

  1. Too many parameters: Overfitting, slow training
  2. Ignores spatial structure: Adjacent pixels are related
  3. Not translation invariant: Cat in corner vs center looks different

The Convolution Operation

Slide a small filter (kernel) across the image:

Input           Filter       Output (Feature Map)
[1 2 3 4]                    
[5 6 7 8]   *   [1 0]    =   [12, 16, 20]
[9 0 1 2]       [0 1]        [6,  8,  10]

Calculation: 1×1 + 2×0 + 5×0 + 6×1 = 7... (simplified)

Each filter detects one type of feature (edge, texture, etc.).

Key Properties

1. Local Connectivity

Neurons only connect to a local region of input.

  • Exploits spatial locality
  • Dramatically fewer parameters

2. Weight Sharing

Same filter weights applied across entire image.

  • Detects feature anywhere in image
  • Translation equivariance

3. Multiple Filters

Many filters → many feature maps.

  • Different filters detect different features
  • Channels in deeper layers

CNN Building Blocks

Convolutional Layer

Parameters:
- Number of filters (output channels)
- Kernel size (usually 3×3 or 5×5)
- Stride (step size, usually 1 or 2)
- Padding (usually 'same' or 'valid')

Output size:

H_out = (H_in + 2×padding - kernel_size) / stride + 1

Pooling Layer

Downsamples feature maps:

Max Pooling (most common):

[1 3]   Max    
[2 4]  ────→  [4]  (take max of 2×2 region)

Average Pooling: Take mean instead.

Global Average Pooling: Average entire feature map to single value.

Activation (ReLU)

Applied element-wise after convolution.

Fully Connected Layers

At the end for classification:

Conv → Pool → Conv → Pool → Flatten → FC → Softmax

CNN Architecture Pattern

[Conv → ReLU] × N → Pool → [Conv → ReLU] × M → Pool → ... → FC → Output

As you go deeper:

  • Spatial dimensions ↓ (pooling)
  • Number of channels ↑ (more filters)
  • Features become more abstract

Famous Architectures

LeNet (1998)

First successful CNN. Digit recognition.

AlexNet (2012)

Started deep learning revolution. ImageNet winner.

  • Deep (8 layers)
  • ReLU activation
  • Dropout
  • GPU training

VGGNet (2014)

Simplicity: only 3×3 convolutions.

  • Deep (16-19 layers)
  • Shows depth matters

ResNet (2015)

Residual connections enable very deep networks:

output = F(x) + x  (skip connection)
  • Up to 152+ layers
  • Solves vanishing gradient
  • Still widely used

Modern Architectures

  • EfficientNet: Balanced scaling
  • ConvNeXt: CNN matching Transformers
  • Vision Transformers (ViT): Pure attention (not CNN)

Receptive Field

The region of input that affects a neuron:

Layer 1 neuron: sees 3×3
Layer 2 neuron: sees 5×5 (3×3 of layer 1 outputs)
Layer 3 neuron: sees 7×7
...

Deeper → larger receptive field → more global features.

1×1 Convolutions

Surprisingly useful:

  • Change number of channels
  • Add non-linearity
  • Reduce parameters
  • Used in Inception, ResNet bottlenecks

Transfer Learning

CNNs pretrained on ImageNet transfer to other tasks:

  1. Feature extraction: Use pretrained CNN as fixed feature extractor
  2. Fine-tuning: Train pretrained CNN on new task with small learning rate

Works because early layers learn generic features (edges, textures).

Beyond Images

CNNs work on any grid-like data:

  • 1D convolutions: Sequences, time series, audio
  • 3D convolutions: Video, medical imaging
  • Graph convolutions: Molecular data, social networks

Key Takeaways

  1. CNNs use convolutions for spatial feature extraction
  2. Local connectivity + weight sharing = fewer parameters
  3. Deeper layers capture more abstract features
  4. Residual connections enable very deep networks
  5. Pretrained CNNs are great feature extractors
  6. Vision Transformers are emerging as alternative