What is Byte Pair Encoding (BPE)?

BPE is the dominant tokenization algorithm. Start with individual characters as tokens. Find the most frequent adjacent pair in the corpus and merge them into a new token. Repeat thousands of times. After enough merges, common words become single tokens ('hello' → one ID), less common words split into pieces ('tokenization' → 'token' + 'ization'), and very rare characters stay as single-character tokens. The result is efficient, flexible coverage of any text.

What is the vocabulary size of common LLMs?

GPT-2 and GPT-3: 50,257 tokens. GPT-4: approximately 100,000 tokens. Llama 2: 32,000 tokens. Llama 3: 128,000 tokens — expanded significantly to improve multilingual and code coverage. Larger vocabularies mean more single-token representations for more words, reducing sequence length and improving efficiency, but require a larger embedding table.

Tokenization – UseCaseinAI

Q: What is tokenization in simple terms?

Tokenization is cutting text into pieces a model can process — and converting those pieces into numbers. 'Hello world' becomes [15496, 995] (two integer IDs). The model never sees letters or words — only these integers. Tokenization is the first step in every NLP pipeline and the last step in generation: the model produces integer IDs that the tokenizer converts back into readable text.

Q: Why does tokenization affect multilingual performance?

Tokenizers trained predominantly on English text assign common English words to single tokens but require many tokens for equivalent content in other languages. A sentence in English might be 10 tokens; the same sentence translated to Tamil might be 30-50 tokens. This means non-English languages consume more of the context window, cost more per API call, and may be represented less richly in the model's vocabulary — contributing to worse performance on low-resource languages.

⚡ Tokenization converts raw text into discrete tokens — the integers a language model actually processes. “Hello world” becomes [15496, 995]. Every LLM pipeline starts with tokenization (text → integers) and ends with detokenization (integers → text). The tokenizer’s design directly shapes how efficiently different languages, code, and numbers are represented — and therefore how well the model handles them.

Category: NLP & Language · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read

Tokenization — What It Is, How Text Becomes Numbers & Why It Shapes Every LLM’s Capability

What is Tokenization?

A language model is a mathematical function. It takes numbers as input and produces numbers as output. Text is not numbers. Before a model can process “What is the capital of France?” it must be converted into a sequence of integers — each integer representing one token from the model’s fixed vocabulary.

Tokenization is this conversion. A tokenizer maps text to token IDs on the way in, and maps token IDs back to text on the way out. The model itself never sees characters — only the integer sequence. Generation happens by predicting the next integer in the sequence, one at a time, until a stop token is produced.

The tokenizer is trained separately from the model — on a large text corpus using algorithms like BPE (Byte Pair Encoding) or WordPiece — and determines which token boundaries the model will use for its entire lifetime. Changing the tokenizer requires retraining the model from scratch.

How Byte Pair Encoding works

Start: each character in the training corpus is its own token. Vocabulary = all unique characters.
Count: find the most frequently occurring adjacent token pair.
Merge: combine that pair into a new single token. Add it to the vocabulary.
Repeat steps 2-3 for a fixed number of merges (typically 30,000–100,000).
Result: a vocabulary where common words are single tokens, uncommon words are split into common sub-pieces, and very rare characters remain as individual tokens.

After training, the tokenizer applies the learned merge rules greedily to new text — finding the longest matching token at each position.

Tokenization challenges

Numbers: each digit is often a separate token. “1234567” may be 7 tokens. This fragments numerical information and partially explains why LLMs struggle with arithmetic — adjacent digits are often in separate tokens with no implied positional relationship.

Code: programming syntax (brackets, semicolons, indentation spaces) is tokenised individually. Code is often less efficiently tokenised than natural language — more tokens per line than prose of equivalent length.

Non-Latin scripts: scripts like Chinese, Arabic, and Devanagari are underrepresented in training corpora, so their characters often do not merge into efficient multi-character tokens. One Chinese character may be 1-3 tokens while one English word may also be 1 token — but the Chinese text conveys much more information per character.

Whitespace and special characters: spaces are often incorporated into the following word’s token (“▁hello” = the token for “hello” preceded by a space). This requires awareness when working with raw token IDs.

Real-world examples

Not theory — what real teams actually shipped using this technique.

OpenAI’s tiktoken — the tokenizer used for GPT-3.5, GPT-4, and the embeddings API. Freely available as a Python library. The tokenizer playground at platform.openai.com/tokenizer lets you paste text and see exactly how it is tokenised.
Llama 3’s expanded vocabulary — Meta increased Llama 3’s vocabulary from Llama 2’s 32,000 to 128,000 tokens, specifically to improve multilingual tokenization efficiency and reduce the token count for non-English languages.
The “strawberry” problem — GPT models famously miscounted the letter ‘r’ in “strawberry.” Part of the reason: “strawberry” tokenises as “straw” + “berry” — the model sees two tokens, not individual letters, making character-level operations genuinely harder than word-level ones.

Common pitfalls

Token boundary assumptions — never assume tokenisation boundaries align with word boundaries. “New York” may be two tokens or one. “don’t” may be “don” + “‘t”. Character-level operations (counting letters, reversing strings) require special handling.
Tokenizer mismatch — using a different tokenizer than the one the model was trained with produces wrong token IDs and garbled outputs. Always use the official tokenizer for each model.
Prompt token counting for context limits — calculating how many tokens a prompt uses requires running the actual tokenizer. Word count estimates are inaccurate. Use tiktoken or the model’s API’s token counting endpoint.
Vocabulary limitation — a token not in the vocabulary (rare character, new emoji) gets split into byte-level fallback tokens. Unusual Unicode characters can produce unexpectedly long token sequences.

Frequently asked questions

QUESTION 1 What is tokenization in simple terms?

ANSWER 1 Converting text into integers a model can process — and converting integers back to text after generation. The first and last step of every LLM pipeline.

QUESTION 2 What is Byte Pair Encoding?

ANSWER 2 An algorithm that iteratively merges the most frequent adjacent character pairs into tokens — producing common words as single tokens and rare words as recognisable sub-pieces.

QUESTION 3 Why does tokenization affect multilingual performance?

ANSWER 3 Tokenizers trained on English-heavy data assign fewer tokens per word in English than in other languages — non-English content uses more context window and costs more per API call.

QUESTION 4 What are common vocabulary sizes?

ANSWER 4 GPT-4: ~100,000. Llama 3: 128,000. GPT-3: 50,257. Larger vocabularies mean more single-token words, shorter sequences, and better multilingual coverage.

Sources & further reading

Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 — BPE tokenization paper.
Kudo & Richardson (2018). SentencePiece. arXiv:1808.06226 — language-agnostic tokenization.
OpenAI Tokenizer: platform.openai.com/tokenizer — interactive tokenization explorer.
tiktoken GitHub: github.com/openai/tiktoken — fast Python BPE tokenizer.
Hugging Face tokenizers: huggingface.co/docs/tokenizers — comprehensive tokenizer library documentation.

📬 Get one concept + one use case every Tuesday. Join the newsletter →