Tokens & Tokenization

Why This Matters

Every time you send text to an LLM, the first thing that happens is tokenization — the text gets chopped into small pieces called tokens. Think of it like how a sentence gets broken into words, except tokens aren't always neat words.

The Intuition

Imagine you're packing a suitcase (the LLM's context window). You can't just throw in whole paragraphs — you need to fold everything into standard-sized pieces first. Tokenization is that folding process. Common words like "the" get a single token, while rare words like "amprealize" might need 3-4 tokens.

How It Works

Byte-Pair Encoding (BPE): The most common approach. Starts with individual characters, then iteratively merges the most frequent pairs. "running" might become ["run", "ning"].
WordPiece: Used by BERT. Similar to BPE but uses likelihood instead of frequency for merging.
SentencePiece: Language-agnostic. Works directly on raw text without pre-tokenization.

Key Numbers to Know

Model	Vocabulary Size	Context Window
GPT-4	~100,000 tokens	128K tokens
Claude	~100,000 tokens	200K tokens
Llama 3	~128,000 tokens	128K tokens

Rule of thumb: 1 token ≈ 4 characters in English, or about ¾ of a word.

Why Token Count Matters

Cost: LLM APIs charge per token (input + output)
Context window: You can only fit so many tokens before the model forgets
Speed: More tokens = longer generation time