Why do language models use tokens instead of words?

Tokenising by word fails for rare words, technical terms, and different languages — a purely word-based vocabulary needs millions of entries to cover everything. Tokenising by character gives tiny vocabulary but very long sequences (every letter is one token). Subword tokenisation (BPE, WordPiece) splits at a level that balances vocabulary size with sequence length — common words are single tokens, rare words split into recognisable sub-pieces.

Token – UseCaseinAI

Q: What is a token in simple terms?

A token is a chunk of text — roughly a word or part of a word. Language models do not see characters or words directly. They see tokens. 'Hello world' might be 2 tokens. 'Antidisestablishmentarianism' might be 6 tokens. A rough rule: 100 tokens ≈ 75 English words ≈ 300 characters. Every interaction with an LLM — your message and the response — is measured and priced in tokens.

Q: How does token count affect LLM pricing?

LLM APIs charge per token — typically separately for input tokens (your prompt) and output tokens (the response). OpenAI's GPT-4 costs approximately $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens. A 1,000-word essay is roughly 1,300 tokens. At scale — millions of API calls per day — token efficiency directly determines infrastructure cost. Shorter, more precise prompts and responses reduce costs significantly.

Q: What is the relationship between tokens and the context window?

The context window is measured in tokens — it is the maximum number of tokens (input + output combined) a model can process in one interaction. Claude's 200,000-token context window can hold approximately 150,000 words — about two novels. Every token in the conversation history, system prompt, retrieved documents, and generated response counts against this limit. When the limit is reached, older tokens are dropped.

⚡ A token is the basic unit of text a language model processes — roughly a word or word fragment. 100 tokens ≈ 75 English words. Every LLM interaction is measured in tokens: context windows are token limits, API pricing is per token, and generation happens one token at a time. Understanding tokens is essential for building and budgeting any LLM application.

Category: NLP & Language · Difficulty: Beginner · Last updated: 15 May 2026 · 4 min read

Token — What It Is and Why Everything in LLMs Is Measured in This Single Unit

What is Token?

Language models do not read words. They do not read characters. They read tokens — subword units that sit between the two. A tokeniser splits raw text into this sequence of tokens before the model sees anything. The model generates responses one token at a time. Every pricing calculation, every context limit, and every speed benchmark is expressed in tokens.

Understanding tokens is the single most practically important concept for anyone building LLM applications — more than any architecture detail or training technique. It directly affects what you can send to the model (context window), what you pay (pricing per token), and how fast responses arrive (tokens per second).

How Tokenization works

The dominant approach is Byte Pair Encoding (BPE) — a compression algorithm adapted for NLP. Starting with individual characters, BPE iteratively merges the most frequent adjacent pairs into new tokens. After enough merges, common words become single tokens (“hello” → 1 token), less common words split into recognisable pieces (“unhappiness” → “un” + “happiness” → 3 tokens), and very rare words become sequences of character-level pieces.

The result is a vocabulary of 50,000–100,000 tokens that efficiently covers text across languages, code, and technical content.

Token counts by content type

Standard English prose: ~1 token per 0.75 words (~133 tokens per 100 words).
Code: often more tokens per character — symbols, indentation, and unusual strings tokenise less efficiently.
Non-Latin scripts: often more tokens per character — languages like Chinese or Arabic may use 2-4 tokens per word compared to English.
Numbers: each digit is often a separate token — “1,234,567” may be 7+ tokens.
Whitespace and punctuation: each character typically 1 token.

Real-world examples

Not theory — what real teams actually shipped using this technique.

OpenAI API pricing — GPT-4o charges per 1,000 input and output tokens. A customer support system making 100,000 API calls per day with 500-token prompts and 200-token responses costs approximately (100,000 × 700 tokens ÷ 1,000 × $0.005) = $350/day. Token count is the primary cost driver.
Claude’s 200,000-token context — this means you can send Claude approximately 150,000 words of text in a single interaction — the complete works of Shakespeare plus additional context. Every word of system prompt, conversation history, and retrieved documents counts against this.
Whisper tokeniser for speech — Whisper uses a text tokeniser where each 30-second audio chunk corresponds to a sequence of text tokens. The model generates these tokens autoregressively to produce the transcript.

Common pitfalls

Token counting errors in budgeting — developers often estimate token counts using word counts, leading to cost underestimates. Use the official tokeniser (tiktoken for OpenAI, the model’s tokeniser for others) for accurate counts before estimating costs.
Language disparity — prompts in Chinese or Arabic consume significantly more tokens than equivalent English content — an important consideration for multilingual applications.
Special tokens — models have special tokens beyond regular text: [BOS] (beginning of sequence), EOS, PAD, [MASK] (for masked modelling). These count against the context window but are handled automatically by the API.
Token != meaning unit — splitting at token boundaries for chunking or truncation can split words mid-token, creating garbled text. Always truncate at sentence or paragraph boundaries, not at raw token counts.

Frequently asked questions

QUESTION 1 What is a token in simple terms?

ANSWER 1 A chunk of text — roughly a word or word fragment — that language models process. 100 tokens ≈ 75 English words. Every LLM interaction is measured and priced in tokens.

QUESTION 2 Why do LLMs use tokens instead of words?

ANSWER 2 Subword tokenisation (BPE) balances vocabulary size with sequence length — common words are single tokens, rare words split into recognisable pieces, handling all languages and technical content efficiently.

QUESTION 3 How does token count affect LLM pricing?

ANSWER 3 APIs charge per 1,000 tokens — separately for input and output. At millions of calls per day, token efficiency directly determines infrastructure cost.

QUESTION 4 What is the relationship between tokens and the context window?

ANSWER 4 The context window is the maximum token count (input + output) a model process at once. Claude’s 200,000-token window holds ~150,000 words. All conversation history, prompts, and documents count against this limit.

Sources & further reading

Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 — BPE tokenisation for NLP.
Kudo & Richardson (2018). SentencePiece: A simple and language independent subword tokenizer. arXiv:1808.06226
OpenAI Tokenizer tool: platform.openai.com/tokenizer — interactive tokenisation visualiser.
tiktoken GitHub: github.com/openai/tiktoken — OpenAI’s fast BPE tokeniser library.

📬 Get one concept + one use case every Tuesday. Join the newsletter →