intermediateNatural Language Processing

Master the Transformer - the revolutionary architecture behind BERT, GPT, and all modern language models, built entirely on attention.

transformersattentionbertgptdeep-learning

Transformer Architecture

The Transformer is the architecture that revolutionized NLP and is now conquering all of AI. Introduced in "Attention Is All You Need" (2017), it's the foundation of GPT, BERT, and virtually all modern language models.

The Big Idea

Replace recurrence (RNNs) entirely with attention:

  • RNNs: Process sequentially, slow, limited parallelization
  • Transformers: Process all positions in parallel, fast, scalable

Architecture Overview

         Output
           ↑
    [Linear + Softmax]
           ↑
    ┌──────────────┐
    │   Decoder    │ × N
    │   Block      │
    └──────────────┘
           ↑
    ┌──────────────┐
    │   Encoder    │ × N
    │   Block      │
    └──────────────┘
           ↑
    [Input Embedding + Positional Encoding]
           ↑
        Input

Encoder Block

Each encoder block:

Input
  ↓
[Multi-Head Self-Attention]
  ↓ + Residual
[Layer Norm]
  ↓
[Feed-Forward Network]
  ↓ + Residual
[Layer Norm]
  ↓
Output

Self-Attention

Each position attends to all positions:

  • Captures long-range dependencies
  • Parallel computation
  • See: Attention Mechanism concept

Feed-Forward Network

Two linear layers with non-linearity:

FFN(x) = W₂ × ReLU(W₁x + b₁) + b₂

Applied to each position independently. Often dimension expands then contracts (e.g., 768 → 3072 → 768).

Residual Connections

output = LayerNorm(x + SubLayer(x))

Allow gradients to flow directly, enable depth.

Decoder Block

Similar to encoder, but with:

Input
  ↓
[Masked Self-Attention]  ← Can only attend to past
  ↓ + Residual
[Layer Norm]
  ↓
[Cross-Attention to Encoder]  ← New!
  ↓ + Residual
[Layer Norm]
  ↓
[Feed-Forward Network]
  ↓ + Residual
[Layer Norm]
  ↓
Output

Masked Self-Attention

Prevents looking at future tokens (for autoregressive generation).

Cross-Attention

Queries from decoder, keys/values from encoder:

  • Decoder can focus on relevant encoder outputs
  • Used in translation, summarization

Positional Encoding

Self-attention is permutation invariant - position info must be added!

Sinusoidal (Original)

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Different frequencies for different dimensions.

Learned Positional Embeddings

Learn position embeddings like word embeddings. Common in modern models.

Rotary Position Embeddings (RoPE)

Used in LLaMA, GPT-NeoX. Better for long sequences.

Architecture Variants

Encoder-Only (BERT)

Input → Encoder × N → Output representations

Bidirectional. For classification, NER, etc.

Decoder-Only (GPT)

Input → Decoder × N → Next token prediction

Autoregressive. For generation.

Encoder-Decoder (T5, BART)

Input → Encoder × N → Decoder × N → Output

For translation, summarization.

Key Hyperparameters

ParameterGPT-2BERT-baseGPT-3
Layers (N)121296
Heads121296
d_model76876812288
d_ff3072307249152
Parameters117M110M175B

Training Objectives

Masked Language Modeling (BERT)

"The [MASK] sat on the mat" → predict "cat"

Bidirectional, for understanding.

Causal Language Modeling (GPT)

"The cat sat" → "on", "the", "mat"

Left-to-right, for generation.

Seq2Seq (T5)

"translate: Hello" → "Hola"

Any task as text-to-text.

Why Transformers Won

  1. Parallelization: No sequential dependency
  2. Long-range dependencies: Direct attention to any position
  3. Scalability: More compute → better performance
  4. Transfer learning: Pretrain once, fine-tune everywhere
  5. Versatility: Same architecture for many tasks

Limitations

  1. Quadratic attention: O(n²) memory and compute
  2. Fixed context length: Can't handle arbitrarily long sequences
  3. Compute hungry: Expensive to train and run
  4. No explicit structure: Everything is learned

Solutions for Long Sequences

  • Sparse attention: Attend to subset (Longformer)
  • Linear attention: Approximate with linear complexity (Performer)
  • Recurrence: Add limited recurrence (Transformer-XL)
  • Retrieval: External memory (RAG)

Key Takeaways

  1. Transformers replace recurrence with attention entirely
  2. Self-attention allows parallel processing of sequences
  3. Positional encoding adds position information
  4. Encoder-only (BERT) vs decoder-only (GPT) vs encoder-decoder (T5)
  5. Scale = performance (so far)
  6. Quadratic complexity limits sequence length

Practice Questions

Test your understanding with these related interview questions: