Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that can be processed by models. It's a crucial preprocessing step that significantly impacts model performance.

Why Tokenize?

Neural networks need numerical inputs. Tokenization provides:

"Hello, world!" → [15496, 11, 995, 0] → embeddings → model

Convert text to discrete units
Map units to indices
Look up embeddings

Tokenization Strategies

Word-Level

Split on whitespace and punctuation:

"I love machine learning" → ["I", "love", "machine", "learning"]

Pros:

Intuitive
Each token is meaningful

Cons:

Huge vocabulary (hundreds of thousands)
Out-of-vocabulary (OOV) problem
Can't handle typos, rare words

Character-Level

Each character is a token:

"Hello" → ["H", "e", "l", "l", "o"]

Pros:

Small vocabulary (~100)
No OOV problem
Handles any text

Cons:

Very long sequences
Harder to learn semantics
Lost word boundaries

Subword Tokenization (Modern Standard)

Split words into meaningful subunits:

"unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]
"playing" → ["play", "ing"]

Pros:

Manageable vocabulary (30k-100k)
No true OOV (rare words split into subwords)
Captures morphology
Best of both worlds

Subword Algorithms

Byte-Pair Encoding (BPE)

Used by GPT, RoBERTa, and many others.

Training:

Start with character vocabulary
Count all adjacent pairs
Merge most frequent pair into new token
Repeat until vocabulary size reached

Corpus: "low low low lower"

Step 1: Characters + counts
{'l':4, 'o':4, 'w':4, 'e':1, 'r':1}

Step 2: Most frequent pair "lo" → merge to "lo"

Step 3: Most frequent pair "low" → merge to "low"

Final: {"low", "lower", "l", "o", "w", "e", "r"}

WordPiece

Used by BERT.

Similar to BPE but:

Merges based on likelihood, not frequency
Adds ## prefix for continuation tokens

"playing" → ["play", "##ing"]

Unigram (SentencePiece)

Used by T5, XLNet, LLaMA.

Probabilistic approach:

Start with large vocabulary
Iteratively remove tokens that least reduce likelihood
Stop at target vocabulary size

Can sample multiple tokenizations!

Byte-Level BPE

Used by GPT-2, GPT-3, GPT-4.

Operate on bytes instead of characters:

Works for any language/encoding
Truly no OOV
256 base tokens

Special Tokens

Token	Meaning	Used In
[PAD]	Padding	All
[UNK]	Unknown	Word-level
[CLS]	Classification	BERT
[SEP]	Separator	BERT
[MASK]	Masked token	BERT
<s>	Start of sequence	Many
</s>	End of sequence	Many
<	endoftext	>

Vocabulary Size Tradeoffs

Vocab Size	Sequence Length	Embedding Memory
Smaller	Longer	Less
Larger	Shorter	More

Typical sizes:

GPT-2: 50,257
BERT: 30,522
LLaMA: 32,000

Pre-Tokenization

Before subword tokenization, often split on spaces/punctuation:

# GPT-2 pre-tokenizer regex
pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

Prevents tokens from spanning punctuation boundaries.

Tokenization in Practice

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, world!"
tokens = tokenizer.tokenize(text)  # ['Hello', ',', 'Ġworld', '!']
ids = tokenizer.encode(text)       # [15496, 11, 995, 0]
decoded = tokenizer.decode(ids)    # "Hello, world!"

Common Gotchas

Whitespace Handling

Some tokenizers include leading space:

" hello" → ["Ġhello"]  # GPT-2
"hello" → ["hello"]

Casing

Some tokenizers are case-sensitive, others lowercase:

"Hello" vs "hello" → Same or different tokens?

Tokenization Artifacts

Strange splits can happen:

"ChatGPT" → ["Chat", "G", "PT"]  # Not ideal!

Multilingual Tokenization

Challenges:

Different scripts and alphabets
No whitespace (Chinese, Japanese)
Different subword patterns

Solutions:

Larger vocabulary
Byte-level tokenization
Language-specific tokenizers

Impact on Performance

Tokenization affects:

Sequence length: Impacts memory and speed
Rare words: Better handling with subwords
Languages: Some tokenizers favor English
Arithmetic: Numbers might tokenize oddly

Key Takeaways

Tokenization converts text to model-processable units
Subword tokenization (BPE, WordPiece) is the modern standard
Vocabulary size trades off sequence length vs embedding size
Byte-level BPE handles any text without OOV
Tokenization quirks can affect model behavior
Always use the tokenizer that matches your model!

Tokenization

Why Tokenize?

Tokenization Strategies

Word-Level

Character-Level

Subword Tokenization (Modern Standard)

Subword Algorithms

Byte-Pair Encoding (BPE)

WordPiece

Unigram (SentencePiece)

Byte-Level BPE

Special Tokens

Vocabulary Size Tradeoffs

Pre-Tokenization

Tokenization in Practice

Common Gotchas

Whitespace Handling

Casing

Tokenization Artifacts

Multilingual Tokenization

Impact on Performance

Key Takeaways

Related Concepts