Attention Mechanism
Attention allows neural networks to dynamically focus on relevant parts of the input. It's the key innovation behind Transformers and modern language models.
The Problem: Fixed-Length Bottleneck
Traditional seq2seq models compress entire input into fixed-length vector:
"The quick brown fox" → Encoder → [fixed vector] → Decoder → Translation
Problem: Long sequences can't be compressed without losing information.
The Solution: Attention
Let the decoder look back at all encoder states:
Decoder step: Which encoder states are most relevant?
↓ attention weights
Encoder: [h1] [h2] [h3] [h4]
0.1 0.3 0.5 0.1 ← weights sum to 1
↓
weighted sum = context vector
How It Works
1. Score Function
Compute relevance score between query (decoder state) and keys (encoder states):
score(q, k) = ...
Common scoring functions:
- Dot product: q · k
- Scaled dot product: (q · k) / √d
- Additive: v · tanh(W[q; k])
- Multiplicative: q · W · k
2. Softmax
Convert scores to probabilities (attention weights):
α = softmax(scores)
3. Weighted Sum
Compute context vector:
context = Σ αᵢ × valueᵢ
Query, Key, Value
The modern formulation:
Attention(Q, K, V) = softmax(QKᵀ / √d) × V
Where:
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What information do I provide?
Analogy: Library search
- Query: Your search terms
- Key: Book titles/metadata
- Value: Book contents
Self-Attention
When Q, K, V all come from the same sequence:
"The cat sat on the mat"
For "cat":
Query: representation of "cat"
Keys: representations of all words
Values: representations of all words
Result: "cat" attends strongly to "sat", "mat"
This is what Transformers use!
Multi-Head Attention
Run multiple attention operations in parallel:
head_1 = Attention(QW₁ᵠ, KW₁ᵏ, VW₁ᵛ)
head_2 = Attention(QW₂ᵠ, KW₂ᵏ, VW₂ᵛ)
...
head_h = Attention(QWₕᵠ, KWₕᵏ, VWₕᵛ)
output = Concat(head_1, ..., head_h) × Wᵒ
Each head can learn different relationships:
- Head 1: Syntactic dependencies
- Head 2: Coreference
- Head 3: Semantic similarity
Scaled Dot-Product Attention
Why scale by √d?
Attention(Q, K, V) = softmax(QKᵀ / √d) × V
Without scaling:
- Dot products grow with dimension d
- Softmax becomes very peaked (near one-hot)
- Gradients vanish
Scaling keeps variance stable.
Types of Attention
Cross-Attention
Q from one sequence, K,V from another:
- Used in encoder-decoder models
- Translation: decoder attends to encoder
Self-Attention
Q, K, V from same sequence:
- Each position attends to all positions
- Captures relationships within sequence
Causal (Masked) Attention
Self-attention but can only attend to past:
- For autoregressive generation
- Position i can't see positions > i
Mask: [1, -∞, -∞, -∞]
[1, 1, -∞, -∞]
[1, 1, 1, -∞]
[1, 1, 1, 1]
Attention Visualizations
Attention weights are interpretable! You can see what the model focuses on:
"The cat sat on the mat"
↓ ↓ ↓
0.3 0.5 0.15 ← attention for predicting "it"
Useful for debugging and understanding.
Computational Complexity
Self-attention: O(n² × d)
- n = sequence length
- d = dimension
Problem for long sequences! Solutions:
- Sparse attention (Longformer, BigBird)
- Linear attention (Performer)
- Chunked attention (local windows)
Beyond NLP
Attention works everywhere:
- Vision: Attend to image patches (ViT)
- Speech: Attend to audio frames
- Graphs: Attend to neighboring nodes
- Multimodal: Cross-attention between modalities
Key Takeaways
- Attention computes weighted combinations based on relevance
- Query-Key-Value formulation is standard
- Self-attention enables parallel processing of sequences
- Multi-head attention captures different relationships
- Scaled dot-product prevents gradient issues
- Transformers are built entirely on attention