Attention Mechanism

Attention allows neural networks to dynamically focus on relevant parts of the input. It's the key innovation behind Transformers and modern language models.

The Problem: Fixed-Length Bottleneck

Traditional seq2seq models compress entire input into fixed-length vector:

"The quick brown fox" → Encoder → [fixed vector] → Decoder → Translation

Problem: Long sequences can't be compressed without losing information.

The Solution: Attention

Let the decoder look back at all encoder states:

Decoder step:  Which encoder states are most relevant?
               ↓ attention weights
Encoder:  [h1]   [h2]   [h3]   [h4]
          0.1    0.3    0.5    0.1   ← weights sum to 1
               ↓
          weighted sum = context vector

How It Works

1. Score Function

Compute relevance score between query (decoder state) and keys (encoder states):

score(q, k) = ...

Common scoring functions:

Dot product: q · k
Scaled dot product: (q · k) / √d
Additive: v · tanh(W[q; k])
Multiplicative: q · W · k

2. Softmax

Convert scores to probabilities (attention weights):

α = softmax(scores)

3. Weighted Sum

Compute context vector:

context = Σ αᵢ × valueᵢ

Query, Key, Value

The modern formulation:

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

Where:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide?

Analogy: Library search

Query: Your search terms
Key: Book titles/metadata
Value: Book contents

Self-Attention

When Q, K, V all come from the same sequence:

"The cat sat on the mat"

For "cat":
Query: representation of "cat"
Keys: representations of all words
Values: representations of all words

Result: "cat" attends strongly to "sat", "mat"

This is what Transformers use!

Multi-Head Attention

Run multiple attention operations in parallel:

head_1 = Attention(QW₁ᵠ, KW₁ᵏ, VW₁ᵛ)
head_2 = Attention(QW₂ᵠ, KW₂ᵏ, VW₂ᵛ)
...
head_h = Attention(QWₕᵠ, KWₕᵏ, VWₕᵛ)

output = Concat(head_1, ..., head_h) × Wᵒ

Each head can learn different relationships:

Head 1: Syntactic dependencies
Head 2: Coreference
Head 3: Semantic similarity

Scaled Dot-Product Attention

Why scale by √d?

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

Without scaling:

Dot products grow with dimension d
Softmax becomes very peaked (near one-hot)
Gradients vanish

Scaling keeps variance stable.

Types of Attention

Cross-Attention

Q from one sequence, K,V from another:

Used in encoder-decoder models
Translation: decoder attends to encoder

Self-Attention

Q, K, V from same sequence:

Each position attends to all positions
Captures relationships within sequence

Causal (Masked) Attention

Self-attention but can only attend to past:

For autoregressive generation
Position i can't see positions > i

Mask:  [1, -∞, -∞, -∞]
       [1,  1, -∞, -∞]
       [1,  1,  1, -∞]
       [1,  1,  1,  1]

Attention Visualizations

Attention weights are interpretable! You can see what the model focuses on:

"The cat sat on the mat"
     ↓     ↓         ↓
   0.3   0.5       0.15  ← attention for predicting "it"

Useful for debugging and understanding.

Computational Complexity

Self-attention: O(n² × d)

n = sequence length
d = dimension

Problem for long sequences! Solutions:

Sparse attention (Longformer, BigBird)
Linear attention (Performer)
Chunked attention (local windows)

Beyond NLP

Attention works everywhere:

Vision: Attend to image patches (ViT)
Speech: Attend to audio frames
Graphs: Attend to neighboring nodes
Multimodal: Cross-attention between modalities

Key Takeaways

Attention computes weighted combinations based on relevance
Query-Key-Value formulation is standard
Self-attention enables parallel processing of sequences
Multi-head attention captures different relationships
Scaled dot-product prevents gradient issues
Transformers are built entirely on attention

Attention Mechanism

The Problem: Fixed-Length Bottleneck

The Solution: Attention

How It Works

1. Score Function

2. Softmax

3. Weighted Sum

Query, Key, Value

Self-Attention

Multi-Head Attention

Scaled Dot-Product Attention

Types of Attention

Cross-Attention

Self-Attention

Causal (Masked) Attention

Attention Visualizations

Computational Complexity

Beyond NLP

Key Takeaways

Related Concepts