Tokenization
Tokenization is the process of breaking text into smaller units (tokens) that can be processed by models. It's a crucial preprocessing step that significantly impacts model performance.
Why Tokenize?
Neural networks need numerical inputs. Tokenization provides:
"Hello, world!" → [15496, 11, 995, 0] → embeddings → model
- Convert text to discrete units
- Map units to indices
- Look up embeddings
Tokenization Strategies
Word-Level
Split on whitespace and punctuation:
"I love machine learning" → ["I", "love", "machine", "learning"]
Pros:
- Intuitive
- Each token is meaningful
Cons:
- Huge vocabulary (hundreds of thousands)
- Out-of-vocabulary (OOV) problem
- Can't handle typos, rare words
Character-Level
Each character is a token:
"Hello" → ["H", "e", "l", "l", "o"]
Pros:
- Small vocabulary (~100)
- No OOV problem
- Handles any text
Cons:
- Very long sequences
- Harder to learn semantics
- Lost word boundaries
Subword Tokenization (Modern Standard)
Split words into meaningful subunits:
"unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]
"playing" → ["play", "ing"]
Pros:
- Manageable vocabulary (30k-100k)
- No true OOV (rare words split into subwords)
- Captures morphology
- Best of both worlds
Subword Algorithms
Byte-Pair Encoding (BPE)
Used by GPT, RoBERTa, and many others.
Training:
- Start with character vocabulary
- Count all adjacent pairs
- Merge most frequent pair into new token
- Repeat until vocabulary size reached
Corpus: "low low low lower"
Step 1: Characters + counts
{'l':4, 'o':4, 'w':4, 'e':1, 'r':1}
Step 2: Most frequent pair "lo" → merge to "lo"
Step 3: Most frequent pair "low" → merge to "low"
Final: {"low", "lower", "l", "o", "w", "e", "r"}
WordPiece
Used by BERT.
Similar to BPE but:
- Merges based on likelihood, not frequency
- Adds ## prefix for continuation tokens
"playing" → ["play", "##ing"]
Unigram (SentencePiece)
Used by T5, XLNet, LLaMA.
Probabilistic approach:
- Start with large vocabulary
- Iteratively remove tokens that least reduce likelihood
- Stop at target vocabulary size
Can sample multiple tokenizations!
Byte-Level BPE
Used by GPT-2, GPT-3, GPT-4.
Operate on bytes instead of characters:
- Works for any language/encoding
- Truly no OOV
- 256 base tokens
Special Tokens
| Token | Meaning | Used In |
|---|---|---|
| [PAD] | Padding | All |
| [UNK] | Unknown | Word-level |
| [CLS] | Classification | BERT |
| [SEP] | Separator | BERT |
| [MASK] | Masked token | BERT |
| <s> | Start of sequence | Many |
| </s> | End of sequence | Many |
| < | endoftext | > |
Vocabulary Size Tradeoffs
| Vocab Size | Sequence Length | Embedding Memory |
|---|---|---|
| Smaller | Longer | Less |
| Larger | Shorter | More |
Typical sizes:
- GPT-2: 50,257
- BERT: 30,522
- LLaMA: 32,000
Pre-Tokenization
Before subword tokenization, often split on spaces/punctuation:
# GPT-2 pre-tokenizer regex
pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
Prevents tokens from spanning punctuation boundaries.
Tokenization in Practice
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
tokens = tokenizer.tokenize(text) # ['Hello', ',', 'Ġworld', '!']
ids = tokenizer.encode(text) # [15496, 11, 995, 0]
decoded = tokenizer.decode(ids) # "Hello, world!"
Common Gotchas
Whitespace Handling
Some tokenizers include leading space:
" hello" → ["Ġhello"] # GPT-2
"hello" → ["hello"]
Casing
Some tokenizers are case-sensitive, others lowercase:
"Hello" vs "hello" → Same or different tokens?
Tokenization Artifacts
Strange splits can happen:
"ChatGPT" → ["Chat", "G", "PT"] # Not ideal!
Multilingual Tokenization
Challenges:
- Different scripts and alphabets
- No whitespace (Chinese, Japanese)
- Different subword patterns
Solutions:
- Larger vocabulary
- Byte-level tokenization
- Language-specific tokenizers
Impact on Performance
Tokenization affects:
- Sequence length: Impacts memory and speed
- Rare words: Better handling with subwords
- Languages: Some tokenizers favor English
- Arithmetic: Numbers might tokenize oddly
Key Takeaways
- Tokenization converts text to model-processable units
- Subword tokenization (BPE, WordPiece) is the modern standard
- Vocabulary size trades off sequence length vs embedding size
- Byte-level BPE handles any text without OOV
- Tokenization quirks can affect model behavior
- Always use the tokenizer that matches your model!