Transformer Architecture
The Transformer is the architecture that revolutionized NLP and is now conquering all of AI. Introduced in "Attention Is All You Need" (2017), it's the foundation of GPT, BERT, and virtually all modern language models.
The Big Idea
Replace recurrence (RNNs) entirely with attention:
- RNNs: Process sequentially, slow, limited parallelization
- Transformers: Process all positions in parallel, fast, scalable
Architecture Overview
Output
↑
[Linear + Softmax]
↑
┌──────────────┐
│ Decoder │ × N
│ Block │
└──────────────┘
↑
┌──────────────┐
│ Encoder │ × N
│ Block │
└──────────────┘
↑
[Input Embedding + Positional Encoding]
↑
Input
Encoder Block
Each encoder block:
Input
↓
[Multi-Head Self-Attention]
↓ + Residual
[Layer Norm]
↓
[Feed-Forward Network]
↓ + Residual
[Layer Norm]
↓
Output
Self-Attention
Each position attends to all positions:
- Captures long-range dependencies
- Parallel computation
- See: Attention Mechanism concept
Feed-Forward Network
Two linear layers with non-linearity:
FFN(x) = W₂ × ReLU(W₁x + b₁) + b₂
Applied to each position independently. Often dimension expands then contracts (e.g., 768 → 3072 → 768).
Residual Connections
output = LayerNorm(x + SubLayer(x))
Allow gradients to flow directly, enable depth.
Decoder Block
Similar to encoder, but with:
Input
↓
[Masked Self-Attention] ← Can only attend to past
↓ + Residual
[Layer Norm]
↓
[Cross-Attention to Encoder] ← New!
↓ + Residual
[Layer Norm]
↓
[Feed-Forward Network]
↓ + Residual
[Layer Norm]
↓
Output
Masked Self-Attention
Prevents looking at future tokens (for autoregressive generation).
Cross-Attention
Queries from decoder, keys/values from encoder:
- Decoder can focus on relevant encoder outputs
- Used in translation, summarization
Positional Encoding
Self-attention is permutation invariant - position info must be added!
Sinusoidal (Original)
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Different frequencies for different dimensions.
Learned Positional Embeddings
Learn position embeddings like word embeddings. Common in modern models.
Rotary Position Embeddings (RoPE)
Used in LLaMA, GPT-NeoX. Better for long sequences.
Architecture Variants
Encoder-Only (BERT)
Input → Encoder × N → Output representations
Bidirectional. For classification, NER, etc.
Decoder-Only (GPT)
Input → Decoder × N → Next token prediction
Autoregressive. For generation.
Encoder-Decoder (T5, BART)
Input → Encoder × N → Decoder × N → Output
For translation, summarization.
Key Hyperparameters
| Parameter | GPT-2 | BERT-base | GPT-3 |
|---|---|---|---|
| Layers (N) | 12 | 12 | 96 |
| Heads | 12 | 12 | 96 |
| d_model | 768 | 768 | 12288 |
| d_ff | 3072 | 3072 | 49152 |
| Parameters | 117M | 110M | 175B |
Training Objectives
Masked Language Modeling (BERT)
"The [MASK] sat on the mat" → predict "cat"
Bidirectional, for understanding.
Causal Language Modeling (GPT)
"The cat sat" → "on", "the", "mat"
Left-to-right, for generation.
Seq2Seq (T5)
"translate: Hello" → "Hola"
Any task as text-to-text.
Why Transformers Won
- Parallelization: No sequential dependency
- Long-range dependencies: Direct attention to any position
- Scalability: More compute → better performance
- Transfer learning: Pretrain once, fine-tune everywhere
- Versatility: Same architecture for many tasks
Limitations
- Quadratic attention: O(n²) memory and compute
- Fixed context length: Can't handle arbitrarily long sequences
- Compute hungry: Expensive to train and run
- No explicit structure: Everything is learned
Solutions for Long Sequences
- Sparse attention: Attend to subset (Longformer)
- Linear attention: Approximate with linear complexity (Performer)
- Recurrence: Add limited recurrence (Transformer-XL)
- Retrieval: External memory (RAG)
Key Takeaways
- Transformers replace recurrence with attention entirely
- Self-attention allows parallel processing of sequences
- Positional encoding adds position information
- Encoder-only (BERT) vs decoder-only (GPT) vs encoder-decoder (T5)
- Scale = performance (so far)
- Quadratic complexity limits sequence length