Transformer Architecture

The Transformer is the architecture that revolutionized NLP and is now conquering all of AI. Introduced in "Attention Is All You Need" (2017), it's the foundation of GPT, BERT, and virtually all modern language models.

The Big Idea

Replace recurrence (RNNs) entirely with attention:

RNNs: Process sequentially, slow, limited parallelization
Transformers: Process all positions in parallel, fast, scalable

Architecture Overview

         Output
           ↑
    [Linear + Softmax]
           ↑
    ┌──────────────┐
    │   Decoder    │ × N
    │   Block      │
    └──────────────┘
           ↑
    ┌──────────────┐
    │   Encoder    │ × N
    │   Block      │
    └──────────────┘
           ↑
    [Input Embedding + Positional Encoding]
           ↑
        Input

Encoder Block

Each encoder block:

Input
  ↓
[Multi-Head Self-Attention]
  ↓ + Residual
[Layer Norm]
  ↓
[Feed-Forward Network]
  ↓ + Residual
[Layer Norm]
  ↓
Output

Self-Attention

Each position attends to all positions:

Captures long-range dependencies
Parallel computation
See: Attention Mechanism concept

Feed-Forward Network

Two linear layers with non-linearity:

FFN(x) = W₂ × ReLU(W₁x + b₁) + b₂

Applied to each position independently. Often dimension expands then contracts (e.g., 768 → 3072 → 768).

Residual Connections

output = LayerNorm(x + SubLayer(x))

Allow gradients to flow directly, enable depth.

Decoder Block

Similar to encoder, but with:

Input
  ↓
[Masked Self-Attention]  ← Can only attend to past
  ↓ + Residual
[Layer Norm]
  ↓
[Cross-Attention to Encoder]  ← New!
  ↓ + Residual
[Layer Norm]
  ↓
[Feed-Forward Network]
  ↓ + Residual
[Layer Norm]
  ↓
Output

Masked Self-Attention

Prevents looking at future tokens (for autoregressive generation).

Cross-Attention

Queries from decoder, keys/values from encoder:

Decoder can focus on relevant encoder outputs
Used in translation, summarization

Positional Encoding

Self-attention is permutation invariant - position info must be added!

Sinusoidal (Original)

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Different frequencies for different dimensions.

Learned Positional Embeddings

Learn position embeddings like word embeddings. Common in modern models.

Rotary Position Embeddings (RoPE)

Used in LLaMA, GPT-NeoX. Better for long sequences.

Architecture Variants

Encoder-Only (BERT)

Input → Encoder × N → Output representations

Bidirectional. For classification, NER, etc.

Decoder-Only (GPT)

Input → Decoder × N → Next token prediction

Autoregressive. For generation.

Encoder-Decoder (T5, BART)

Input → Encoder × N → Decoder × N → Output

For translation, summarization.

Key Hyperparameters

Parameter	GPT-2	BERT-base	GPT-3
Layers (N)	12	12	96
Heads	12	12	96
d_model	768	768	12288
d_ff	3072	3072	49152
Parameters	117M	110M	175B

Training Objectives

Masked Language Modeling (BERT)

"The [MASK] sat on the mat" → predict "cat"

Bidirectional, for understanding.

Causal Language Modeling (GPT)

"The cat sat" → "on", "the", "mat"

Left-to-right, for generation.

Seq2Seq (T5)

"translate: Hello" → "Hola"

Any task as text-to-text.

Why Transformers Won

Parallelization: No sequential dependency
Long-range dependencies: Direct attention to any position
Scalability: More compute → better performance
Transfer learning: Pretrain once, fine-tune everywhere
Versatility: Same architecture for many tasks

Limitations

Quadratic attention: O(n²) memory and compute
Fixed context length: Can't handle arbitrarily long sequences
Compute hungry: Expensive to train and run
No explicit structure: Everything is learned

Solutions for Long Sequences

Sparse attention: Attend to subset (Longformer)
Linear attention: Approximate with linear complexity (Performer)
Recurrence: Add limited recurrence (Transformer-XL)
Retrieval: External memory (RAG)

Key Takeaways

Transformers replace recurrence with attention entirely
Self-attention allows parallel processing of sequences
Positional encoding adds position information
Encoder-only (BERT) vs decoder-only (GPT) vs encoder-decoder (T5)
Scale = performance (so far)
Quadratic complexity limits sequence length