⚡ Tokenization converts raw text into discrete tokens — the integers a language model actually processes. “Hello world” becomes [15496, 995]. Every LLM pipeline starts with tokenization (text → integers) and ends with detokenization (integers → text). The tokenizer’s design directly shapes how efficiently different languages, code, and numbers are represented — and therefore how well the model handles them.
Category: NLP & Language · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read
Tokenization — What It Is, How Text Becomes Numbers & Why It Shapes Every LLM’s Capability
What is Tokenization?
A language model is a mathematical function. It takes numbers as input and produces numbers as output. Text is not numbers. Before a model can process “What is the capital of France?” it must be converted into a sequence of integers — each integer representing one token from the model’s fixed vocabulary.
Tokenization is this conversion. A tokenizer maps text to token IDs on the way in, and maps token IDs back to text on the way out. The model itself never sees characters — only the integer sequence. Generation happens by predicting the next integer in the sequence, one at a time, until a stop token is produced.
The tokenizer is trained separately from the model — on a large text corpus using algorithms like BPE (Byte Pair Encoding) or WordPiece — and determines which token boundaries the model will use for its entire lifetime. Changing the tokenizer requires retraining the model from scratch.
How Byte Pair Encoding works
- Start: each character in the training corpus is its own token. Vocabulary = all unique characters.
- Count: find the most frequently occurring adjacent token pair.
- Merge: combine that pair into a new single token. Add it to the vocabulary.
- Repeat steps 2-3 for a fixed number of merges (typically 30,000–100,000).
- Result: a vocabulary where common words are single tokens, uncommon words are split into common sub-pieces, and very rare characters remain as individual tokens.
After training, the tokenizer applies the learned merge rules greedily to new text — finding the longest matching token at each position.
Tokenization challenges
Numbers: each digit is often a separate token. “1234567” may be 7 tokens. This fragments numerical information and partially explains why LLMs struggle with arithmetic — adjacent digits are often in separate tokens with no implied positional relationship.
Code: programming syntax (brackets, semicolons, indentation spaces) is tokenised individually. Code is often less efficiently tokenised than natural language — more tokens per line than prose of equivalent length.
Non-Latin scripts: scripts like Chinese, Arabic, and Devanagari are underrepresented in training corpora, so their characters often do not merge into efficient multi-character tokens. One Chinese character may be 1-3 tokens while one English word may also be 1 token — but the Chinese text conveys much more information per character.
Whitespace and special characters: spaces are often incorporated into the following word’s token (“▁hello” = the token for “hello” preceded by a space). This requires awareness when working with raw token IDs.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- OpenAI’s tiktoken — the tokenizer used for GPT-3.5, GPT-4, and the embeddings API. Freely available as a Python library. The tokenizer playground at platform.openai.com/tokenizer lets you paste text and see exactly how it is tokenised.
- Llama 3’s expanded vocabulary — Meta increased Llama 3’s vocabulary from Llama 2’s 32,000 to 128,000 tokens, specifically to improve multilingual tokenization efficiency and reduce the token count for non-English languages.
- The “strawberry” problem — GPT models famously miscounted the letter ‘r’ in “strawberry.” Part of the reason: “strawberry” tokenises as “straw” + “berry” — the model sees two tokens, not individual letters, making character-level operations genuinely harder than word-level ones.
Common pitfalls
- Token boundary assumptions — never assume tokenisation boundaries align with word boundaries. “New York” may be two tokens or one. “don’t” may be “don” + “‘t”. Character-level operations (counting letters, reversing strings) require special handling.
- Tokenizer mismatch — using a different tokenizer than the one the model was trained with produces wrong token IDs and garbled outputs. Always use the official tokenizer for each model.
- Prompt token counting for context limits — calculating how many tokens a prompt uses requires running the actual tokenizer. Word count estimates are inaccurate. Use tiktoken or the model’s API’s token counting endpoint.
- Vocabulary limitation — a token not in the vocabulary (rare character, new emoji) gets split into byte-level fallback tokens. Unusual Unicode characters can produce unexpectedly long token sequences.
Frequently asked questions
QUESTION 1 What is tokenization in simple terms?
ANSWER 1 Converting text into integers a model can process — and converting integers back to text after generation. The first and last step of every LLM pipeline.
QUESTION 2 What is Byte Pair Encoding?
ANSWER 2 An algorithm that iteratively merges the most frequent adjacent character pairs into tokens — producing common words as single tokens and rare words as recognisable sub-pieces.
QUESTION 3 Why does tokenization affect multilingual performance?
ANSWER 3 Tokenizers trained on English-heavy data assign fewer tokens per word in English than in other languages — non-English content uses more context window and costs more per API call.
QUESTION 4 What are common vocabulary sizes?
ANSWER 4 GPT-4: ~100,000. Llama 3: 128,000. GPT-3: 50,257. Larger vocabularies mean more single-token words, shorter sequences, and better multilingual coverage.
Sources & further reading
- Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 — BPE tokenization paper.
- Kudo & Richardson (2018). SentencePiece. arXiv:1808.06226 — language-agnostic tokenization.
- OpenAI Tokenizer: platform.openai.com/tokenizer — interactive tokenization explorer.
- tiktoken GitHub: github.com/openai/tiktoken — fast Python BPE tokenizer.
- Hugging Face tokenizers: huggingface.co/docs/tokenizers — comprehensive tokenizer library documentation.
📬 Get one concept + one use case every Tuesday. Join the newsletter →