Tokenization converts raw text into discrete tokens — the integers a language model actually processes. “Hello world” becomes [15496, 995]. Every LLM pipeline starts with tokenization (text → integers) and ends with detokenization (integers → text). The tokenizer’s design directly shapes how efficiently different languages, code, and numbers are represented — and therefore how well the model handles them.

Category: NLP & Language · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read


Tokenization — What It Is, How Text Becomes Numbers & Why It Shapes Every LLM’s Capability

What is Tokenization?

A language model is a mathematical function. It takes numbers as input and produces numbers as output. Text is not numbers. Before a model can process “What is the capital of France?” it must be converted into a sequence of integers — each integer representing one token from the model’s fixed vocabulary.

Tokenization is this conversion. A tokenizer maps text to token IDs on the way in, and maps token IDs back to text on the way out. The model itself never sees characters — only the integer sequence. Generation happens by predicting the next integer in the sequence, one at a time, until a stop token is produced.

The tokenizer is trained separately from the model — on a large text corpus using algorithms like BPE (Byte Pair Encoding) or WordPiece — and determines which token boundaries the model will use for its entire lifetime. Changing the tokenizer requires retraining the model from scratch.

How Byte Pair Encoding works

  1. Start: each character in the training corpus is its own token. Vocabulary = all unique characters.
  2. Count: find the most frequently occurring adjacent token pair.
  3. Merge: combine that pair into a new single token. Add it to the vocabulary.
  4. Repeat steps 2-3 for a fixed number of merges (typically 30,000–100,000).
  5. Result: a vocabulary where common words are single tokens, uncommon words are split into common sub-pieces, and very rare characters remain as individual tokens.

After training, the tokenizer applies the learned merge rules greedily to new text — finding the longest matching token at each position.

Tokenization challenges

Numbers: each digit is often a separate token. “1234567” may be 7 tokens. This fragments numerical information and partially explains why LLMs struggle with arithmetic — adjacent digits are often in separate tokens with no implied positional relationship.

Code: programming syntax (brackets, semicolons, indentation spaces) is tokenised individually. Code is often less efficiently tokenised than natural language — more tokens per line than prose of equivalent length.

Non-Latin scripts: scripts like Chinese, Arabic, and Devanagari are underrepresented in training corpora, so their characters often do not merge into efficient multi-character tokens. One Chinese character may be 1-3 tokens while one English word may also be 1 token — but the Chinese text conveys much more information per character.

Whitespace and special characters: spaces are often incorporated into the following word’s token (“▁hello” = the token for “hello” preceded by a space). This requires awareness when working with raw token IDs.

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • OpenAI’s tiktoken — the tokenizer used for GPT-3.5, GPT-4, and the embeddings API. Freely available as a Python library. The tokenizer playground at platform.openai.com/tokenizer lets you paste text and see exactly how it is tokenised.
  • Llama 3’s expanded vocabulary — Meta increased Llama 3’s vocabulary from Llama 2’s 32,000 to 128,000 tokens, specifically to improve multilingual tokenization efficiency and reduce the token count for non-English languages.
  • The “strawberry” problem — GPT models famously miscounted the letter ‘r’ in “strawberry.” Part of the reason: “strawberry” tokenises as “straw” + “berry” — the model sees two tokens, not individual letters, making character-level operations genuinely harder than word-level ones.

Common pitfalls

  • Token boundary assumptions — never assume tokenisation boundaries align with word boundaries. “New York” may be two tokens or one. “don’t” may be “don” + “‘t”. Character-level operations (counting letters, reversing strings) require special handling.
  • Tokenizer mismatch — using a different tokenizer than the one the model was trained with produces wrong token IDs and garbled outputs. Always use the official tokenizer for each model.
  • Prompt token counting for context limits — calculating how many tokens a prompt uses requires running the actual tokenizer. Word count estimates are inaccurate. Use tiktoken or the model’s API’s token counting endpoint.
  • Vocabulary limitation — a token not in the vocabulary (rare character, new emoji) gets split into byte-level fallback tokens. Unusual Unicode characters can produce unexpectedly long token sequences.

Frequently asked questions

QUESTION 1 What is tokenization in simple terms?

ANSWER 1 Converting text into integers a model can process — and converting integers back to text after generation. The first and last step of every LLM pipeline.

QUESTION 2 What is Byte Pair Encoding?

ANSWER 2 An algorithm that iteratively merges the most frequent adjacent character pairs into tokens — producing common words as single tokens and rare words as recognisable sub-pieces.

QUESTION 3 Why does tokenization affect multilingual performance?

ANSWER 3 Tokenizers trained on English-heavy data assign fewer tokens per word in English than in other languages — non-English content uses more context window and costs more per API call.

QUESTION 4 What are common vocabulary sizes?

ANSWER 4 GPT-4: ~100,000. Llama 3: 128,000. GPT-3: 50,257. Larger vocabularies mean more single-token words, shorter sequences, and better multilingual coverage.


Sources & further reading

  • Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 — BPE tokenization paper.
  • Kudo & Richardson (2018). SentencePiece. arXiv:1808.06226 — language-agnostic tokenization.
  • OpenAI Tokenizer: platform.openai.com/tokenizer — interactive tokenization explorer.
  • tiktoken GitHub: github.com/openai/tiktoken — fast Python BPE tokenizer.
  • Hugging Face tokenizers: huggingface.co/docs/tokenizers — comprehensive tokenizer library documentation.

📬 Get one concept + one use case every Tuesday. Join the newsletter →