beginnerNatural Language Processing

Understand tokenization - how text is broken into tokens that neural networks can process, from words to subwords to characters.

preprocessingbpetext-processingtransformers

Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that can be processed by models. It's a crucial preprocessing step that significantly impacts model performance.

Why Tokenize?

Neural networks need numerical inputs. Tokenization provides:

"Hello, world!" → [15496, 11, 995, 0] → embeddings → model
  1. Convert text to discrete units
  2. Map units to indices
  3. Look up embeddings

Tokenization Strategies

Word-Level

Split on whitespace and punctuation:

"I love machine learning" → ["I", "love", "machine", "learning"]

Pros:

  • Intuitive
  • Each token is meaningful

Cons:

  • Huge vocabulary (hundreds of thousands)
  • Out-of-vocabulary (OOV) problem
  • Can't handle typos, rare words

Character-Level

Each character is a token:

"Hello" → ["H", "e", "l", "l", "o"]

Pros:

  • Small vocabulary (~100)
  • No OOV problem
  • Handles any text

Cons:

  • Very long sequences
  • Harder to learn semantics
  • Lost word boundaries

Subword Tokenization (Modern Standard)

Split words into meaningful subunits:

"unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]
"playing" → ["play", "ing"]

Pros:

  • Manageable vocabulary (30k-100k)
  • No true OOV (rare words split into subwords)
  • Captures morphology
  • Best of both worlds

Subword Algorithms

Byte-Pair Encoding (BPE)

Used by GPT, RoBERTa, and many others.

Training:

  1. Start with character vocabulary
  2. Count all adjacent pairs
  3. Merge most frequent pair into new token
  4. Repeat until vocabulary size reached
Corpus: "low low low lower"

Step 1: Characters + counts
{'l':4, 'o':4, 'w':4, 'e':1, 'r':1}

Step 2: Most frequent pair "lo" → merge to "lo"

Step 3: Most frequent pair "low" → merge to "low"

Final: {"low", "lower", "l", "o", "w", "e", "r"}

WordPiece

Used by BERT.

Similar to BPE but:

  • Merges based on likelihood, not frequency
  • Adds ## prefix for continuation tokens
"playing" → ["play", "##ing"]

Unigram (SentencePiece)

Used by T5, XLNet, LLaMA.

Probabilistic approach:

  1. Start with large vocabulary
  2. Iteratively remove tokens that least reduce likelihood
  3. Stop at target vocabulary size

Can sample multiple tokenizations!

Byte-Level BPE

Used by GPT-2, GPT-3, GPT-4.

Operate on bytes instead of characters:

  • Works for any language/encoding
  • Truly no OOV
  • 256 base tokens

Special Tokens

TokenMeaningUsed In
[PAD]PaddingAll
[UNK]UnknownWord-level
[CLS]ClassificationBERT
[SEP]SeparatorBERT
[MASK]Masked tokenBERT
<s>Start of sequenceMany
</s>End of sequenceMany
<endoftext>

Vocabulary Size Tradeoffs

Vocab SizeSequence LengthEmbedding Memory
SmallerLongerLess
LargerShorterMore

Typical sizes:

  • GPT-2: 50,257
  • BERT: 30,522
  • LLaMA: 32,000

Pre-Tokenization

Before subword tokenization, often split on spaces/punctuation:

# GPT-2 pre-tokenizer regex
pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

Prevents tokens from spanning punctuation boundaries.

Tokenization in Practice

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, world!"
tokens = tokenizer.tokenize(text)  # ['Hello', ',', 'Ġworld', '!']
ids = tokenizer.encode(text)       # [15496, 11, 995, 0]
decoded = tokenizer.decode(ids)    # "Hello, world!"

Common Gotchas

Whitespace Handling

Some tokenizers include leading space:

" hello" → ["Ġhello"]  # GPT-2
"hello" → ["hello"]

Casing

Some tokenizers are case-sensitive, others lowercase:

"Hello" vs "hello" → Same or different tokens?

Tokenization Artifacts

Strange splits can happen:

"ChatGPT" → ["Chat", "G", "PT"]  # Not ideal!

Multilingual Tokenization

Challenges:

  • Different scripts and alphabets
  • No whitespace (Chinese, Japanese)
  • Different subword patterns

Solutions:

  • Larger vocabulary
  • Byte-level tokenization
  • Language-specific tokenizers

Impact on Performance

Tokenization affects:

  • Sequence length: Impacts memory and speed
  • Rare words: Better handling with subwords
  • Languages: Some tokenizers favor English
  • Arithmetic: Numbers might tokenize oddly

Key Takeaways

  1. Tokenization converts text to model-processable units
  2. Subword tokenization (BPE, WordPiece) is the modern standard
  3. Vocabulary size trades off sequence length vs embedding size
  4. Byte-level BPE handles any text without OOV
  5. Tokenization quirks can affect model behavior
  6. Always use the tokenizer that matches your model!