intermediateNatural Language Processing

Learn about positional encoding - how transformers represent sequence order since self-attention is position-agnostic.

transformersattentionpositional-encodingropealibi

Positional Encoding

Positional encoding adds sequence position information to transformer models. Since self-attention treats inputs as an unordered set, position must be explicitly encoded.

The Problem

Self-Attention is Permutation-Invariant

Attention("The cat sat") = Attention("sat cat The")

Without position info, the model can't distinguish order!

Position Matters

"Dog bites man" ≠ "Man bites dog"

Sinusoidal Positional Encoding

Original Transformer (Vaswani et al.)

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

pos: Position in sequence (0, 1, 2, ...)
i: Dimension index
d: Model dimension

Visualization

Position 0: [sin(0), cos(0), sin(0), cos(0), ...]
Position 1: [sin(1/ω), cos(1/ω), sin(1/ω²), ...]
Position 2: [sin(2/ω), cos(2/ω), sin(2/ω²), ...]

Different frequencies for different dimensions.

Implementation

import torch
import math

def sinusoidal_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(max_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * 
                         -(math.log(10000.0) / d_model))
    
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

Why Sinusoids?

  1. Deterministic: No learned parameters
  2. Extrapolation: Can handle longer sequences than training
  3. Relative positions: PE(pos+k) can be expressed as linear function of PE(pos)
  4. Bounded values: Sin/cos always in [-1, 1]

Learned Positional Embeddings

Approach

Learn position embeddings like word embeddings:

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.embedding = nn.Embedding(max_len, d_model)
    
    def forward(self, x):
        positions = torch.arange(x.size(1), device=x.device)
        return x + self.embedding(positions)

Used In

  • BERT
  • GPT-1, GPT-2

Pros/Cons

✓ Can learn optimal positions for task
✓ Simple to implement

✗ Fixed maximum length
✗ Can't extrapolate to longer sequences
✗ More parameters

Rotary Position Embeddings (RoPE)

Used In

  • LLaMA
  • GPT-NeoX
  • PaLM

Key Idea

Rotate query and key vectors based on position:

q_rotated = R(θ_pos) × q
k_rotated = R(θ_pos) × k

where R is a rotation matrix

Properties

  • Position info in attention scores, not embeddings
  • Relative position aware: q·k depends on pos_q - pos_k
  • Better extrapolation to long sequences

Simplified Implementation

def rotate_half(x):
    x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
    return torch.cat([-x2, x1], dim=-1)

def apply_rope(x, cos, sin):
    return (x * cos) + (rotate_half(x) * sin)

ALiBi (Attention with Linear Biases)

Used In

  • BLOOM
  • MPT

Key Idea

No positional encoding in embeddings. Instead, add bias to attention scores:

Attention(Q, K) = softmax(QK^T / √d - m × distance)

distance[i,j] = |i - j|
m = slope (different per head)

Properties

  • Zero-shot extrapolation
  • Very simple
  • Works well in practice

Relative Position Encodings

Transformer-XL Style

Attention includes relative position:
A_ij = q_i^T k_j + q_i^T r_{i-j} + u^T k_j + v^T r_{i-j}

r_{i-j}: Relative position embedding
u, v: Learned biases

T5 Relative Position Bias

Learned scalar bias per relative position:

A_ij = q_i^T k_j + b_{i-j}

b_{i-j}: Learned bias for distance (i-j)

Comparison

MethodExtrapolationParametersUsed In
SinusoidalYes0Original Transformer
LearnedNoO(L×d)BERT, GPT-2
RoPEGood0LLaMA, GPT-NeoX
ALiBiExcellentO(heads)BLOOM, MPT
T5 RelativeLimitedO(buckets)T5

How Position Encoding is Applied

Additive (Most Common)

embeddings = word_embeddings + positional_encoding

Concatenative (Less Common)

embeddings = torch.cat([word_embeddings, positional_encoding], dim=-1)

In Attention (RoPE, ALiBi)

# RoPE: rotate Q, K
Q = apply_rope(Q, cos, sin)
K = apply_rope(K, cos, sin)

# ALiBi: bias attention scores
scores = Q @ K.T - alibi_bias

Choosing a Method

ScenarioRecommendation
Fixed short sequencesLearned
Variable/long sequencesRoPE or ALiBi
Extrapolation neededALiBi
Relative positions importantT5-style or RoPE
Simple baselineSinusoidal

Key Takeaways

  1. Transformers need position info (attention is order-agnostic)
  2. Sinusoidal: deterministic, extrapolates, no parameters
  3. Learned: flexible, but can't extrapolate
  4. RoPE: rotates Q/K, good extrapolation, widely used
  5. ALiBi: simple bias, excellent extrapolation
  6. Choice depends on sequence lengths and extrapolation needs

Practice Questions

Test your understanding with these related interview questions: