intermediateNatural Language Processing

Learn about attention - the mechanism that allows neural networks to focus on relevant parts of input, revolutionizing NLP and beyond.

attentiontransformersdeep-learningsequence-modeling

Attention Mechanism

Attention allows neural networks to dynamically focus on relevant parts of the input. It's the key innovation behind Transformers and modern language models.

The Problem: Fixed-Length Bottleneck

Traditional seq2seq models compress entire input into fixed-length vector:

"The quick brown fox" → Encoder → [fixed vector] → Decoder → Translation

Problem: Long sequences can't be compressed without losing information.

The Solution: Attention

Let the decoder look back at all encoder states:

Decoder step:  Which encoder states are most relevant?
               ↓ attention weights
Encoder:  [h1]   [h2]   [h3]   [h4]
          0.1    0.3    0.5    0.1   ← weights sum to 1
               ↓
          weighted sum = context vector

How It Works

1. Score Function

Compute relevance score between query (decoder state) and keys (encoder states):

score(q, k) = ...

Common scoring functions:

  • Dot product: q · k
  • Scaled dot product: (q · k) / √d
  • Additive: v · tanh(W[q; k])
  • Multiplicative: q · W · k

2. Softmax

Convert scores to probabilities (attention weights):

α = softmax(scores)

3. Weighted Sum

Compute context vector:

context = Σ αᵢ × valueᵢ

Query, Key, Value

The modern formulation:

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

Where:

  • Query (Q): What am I looking for?
  • Key (K): What do I contain?
  • Value (V): What information do I provide?

Analogy: Library search

  • Query: Your search terms
  • Key: Book titles/metadata
  • Value: Book contents

Self-Attention

When Q, K, V all come from the same sequence:

"The cat sat on the mat"

For "cat":
Query: representation of "cat"
Keys: representations of all words
Values: representations of all words

Result: "cat" attends strongly to "sat", "mat"

This is what Transformers use!

Multi-Head Attention

Run multiple attention operations in parallel:

head_1 = Attention(QW₁ᵠ, KW₁ᵏ, VW₁ᵛ)
head_2 = Attention(QW₂ᵠ, KW₂ᵏ, VW₂ᵛ)
...
head_h = Attention(QWₕᵠ, KWₕᵏ, VWₕᵛ)

output = Concat(head_1, ..., head_h) × Wᵒ

Each head can learn different relationships:

  • Head 1: Syntactic dependencies
  • Head 2: Coreference
  • Head 3: Semantic similarity

Scaled Dot-Product Attention

Why scale by √d?

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

Without scaling:

  • Dot products grow with dimension d
  • Softmax becomes very peaked (near one-hot)
  • Gradients vanish

Scaling keeps variance stable.

Types of Attention

Cross-Attention

Q from one sequence, K,V from another:

  • Used in encoder-decoder models
  • Translation: decoder attends to encoder

Self-Attention

Q, K, V from same sequence:

  • Each position attends to all positions
  • Captures relationships within sequence

Causal (Masked) Attention

Self-attention but can only attend to past:

  • For autoregressive generation
  • Position i can't see positions > i
Mask:  [1, -∞, -∞, -∞]
       [1,  1, -∞, -∞]
       [1,  1,  1, -∞]
       [1,  1,  1,  1]

Attention Visualizations

Attention weights are interpretable! You can see what the model focuses on:

"The cat sat on the mat"
     ↓     ↓         ↓
   0.3   0.5       0.15  ← attention for predicting "it"

Useful for debugging and understanding.

Computational Complexity

Self-attention: O(n² × d)

  • n = sequence length
  • d = dimension

Problem for long sequences! Solutions:

  • Sparse attention (Longformer, BigBird)
  • Linear attention (Performer)
  • Chunked attention (local windows)

Beyond NLP

Attention works everywhere:

  • Vision: Attend to image patches (ViT)
  • Speech: Attend to audio frames
  • Graphs: Attend to neighboring nodes
  • Multimodal: Cross-attention between modalities

Key Takeaways

  1. Attention computes weighted combinations based on relevance
  2. Query-Key-Value formulation is standard
  3. Self-attention enables parallel processing of sequences
  4. Multi-head attention captures different relationships
  5. Scaled dot-product prevents gradient issues
  6. Transformers are built entirely on attention