Positional Encoding
Positional encoding adds sequence position information to transformer models. Since self-attention treats inputs as an unordered set, position must be explicitly encoded.
The Problem
Self-Attention is Permutation-Invariant
Attention("The cat sat") = Attention("sat cat The")
Without position info, the model can't distinguish order!
Position Matters
"Dog bites man" ≠ "Man bites dog"
Sinusoidal Positional Encoding
Original Transformer (Vaswani et al.)
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
pos: Position in sequence (0, 1, 2, ...)
i: Dimension index
d: Model dimension
Visualization
Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
Position 1: [sin(1/ω), cos(1/ω), sin(1/ω²), ...]
Position 2: [sin(2/ω), cos(2/ω), sin(2/ω²), ...]
Different frequencies for different dimensions.
Implementation
import torch
import math
def sinusoidal_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
Why Sinusoids?
- Deterministic: No learned parameters
- Extrapolation: Can handle longer sequences than training
- Relative positions: PE(pos+k) can be expressed as linear function of PE(pos)
- Bounded values: Sin/cos always in [-1, 1]
Learned Positional Embeddings
Approach
Learn position embeddings like word embeddings:
class LearnedPositionalEncoding(nn.Module):
def __init__(self, max_len, d_model):
super().__init__()
self.embedding = nn.Embedding(max_len, d_model)
def forward(self, x):
positions = torch.arange(x.size(1), device=x.device)
return x + self.embedding(positions)
Used In
- BERT
- GPT-1, GPT-2
Pros/Cons
✓ Can learn optimal positions for task
✓ Simple to implement
✗ Fixed maximum length
✗ Can't extrapolate to longer sequences
✗ More parameters
Rotary Position Embeddings (RoPE)
Used In
- LLaMA
- GPT-NeoX
- PaLM
Key Idea
Rotate query and key vectors based on position:
q_rotated = R(θ_pos) × q
k_rotated = R(θ_pos) × k
where R is a rotation matrix
Properties
- Position info in attention scores, not embeddings
- Relative position aware: q·k depends on pos_q - pos_k
- Better extrapolation to long sequences
Simplified Implementation
def rotate_half(x):
x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
return torch.cat([-x2, x1], dim=-1)
def apply_rope(x, cos, sin):
return (x * cos) + (rotate_half(x) * sin)
ALiBi (Attention with Linear Biases)
Used In
- BLOOM
- MPT
Key Idea
No positional encoding in embeddings. Instead, add bias to attention scores:
Attention(Q, K) = softmax(QK^T / √d - m × distance)
distance[i,j] = |i - j|
m = slope (different per head)
Properties
- Zero-shot extrapolation
- Very simple
- Works well in practice
Relative Position Encodings
Transformer-XL Style
Attention includes relative position:
A_ij = q_i^T k_j + q_i^T r_{i-j} + u^T k_j + v^T r_{i-j}
r_{i-j}: Relative position embedding
u, v: Learned biases
T5 Relative Position Bias
Learned scalar bias per relative position:
A_ij = q_i^T k_j + b_{i-j}
b_{i-j}: Learned bias for distance (i-j)
Comparison
| Method | Extrapolation | Parameters | Used In |
|---|---|---|---|
| Sinusoidal | Yes | 0 | Original Transformer |
| Learned | No | O(L×d) | BERT, GPT-2 |
| RoPE | Good | 0 | LLaMA, GPT-NeoX |
| ALiBi | Excellent | O(heads) | BLOOM, MPT |
| T5 Relative | Limited | O(buckets) | T5 |
How Position Encoding is Applied
Additive (Most Common)
embeddings = word_embeddings + positional_encoding
Concatenative (Less Common)
embeddings = torch.cat([word_embeddings, positional_encoding], dim=-1)
In Attention (RoPE, ALiBi)
# RoPE: rotate Q, K
Q = apply_rope(Q, cos, sin)
K = apply_rope(K, cos, sin)
# ALiBi: bias attention scores
scores = Q @ K.T - alibi_bias
Choosing a Method
| Scenario | Recommendation |
|---|---|
| Fixed short sequences | Learned |
| Variable/long sequences | RoPE or ALiBi |
| Extrapolation needed | ALiBi |
| Relative positions important | T5-style or RoPE |
| Simple baseline | Sinusoidal |
Key Takeaways
- Transformers need position info (attention is order-agnostic)
- Sinusoidal: deterministic, extrapolates, no parameters
- Learned: flexible, but can't extrapolate
- RoPE: rotates Q/K, good extrapolation, widely used
- ALiBi: simple bias, excellent extrapolation
- Choice depends on sequence lengths and extrapolation needs