Most people working with LLMs think about text.
They write prompts, measure characters, worry about word counts, and hope the model does something useful with what they send.
But the model does not see text. It sees tokens. And almost every cost, constraint, and performance characteristic of a language model is expressed in tokens, not words.
This matters more than most engineers realize. Your API bill is denominated in tokens. Your context window is measured in tokens. Your latency scales with tokens. Your retrieval chunks are sized in tokens. Even the fairness of the model across languages is, in large part, a tokenization problem.
This article is the conceptual companion to the Tokenizer tool on this site. If you are building anything non-trivial with LLMs, understanding tokenization is not optional. It is foundational.
What a Token Actually Is
A token is a sequence of characters that a language model treats as a single unit.
It is not a word. It is not a character. It is something in between, determined by a statistical algorithm that has looked at billions of text samples and decided which character sequences are common enough to deserve their own atomic symbol.
Most modern LLMs use a technique called Byte Pair Encoding (BPE) or variants of it. The algorithm starts with individual characters and repeatedly merges the most frequent adjacent pairs into new tokens, until it reaches a fixed vocabulary size, typically between 32,000 and 200,000 tokens.
The result is a vocabulary where:
- Common English words get their own token: the, and, is, have
- Common prefixes and suffixes are their own tokens: un, ing, ed, tion
- Rare or unusual words get split into multiple tokens
- Numbers, punctuation, and whitespace each have their own rules
If you run the phrase "tokenization is fascinating" through a GPT-4 tokenizer, you get something like 4 tokens. If you run the phrase "antidisestablishmentarianism" through the same tokenizer, you also get multiple tokens, because that word is rare enough that the model learned to represent it piece by piece.
This has a concrete consequence: character count and token count are not interchangeable. You cannot estimate one from the other reliably. The only way to know how many tokens a piece of text contains is to actually tokenize it.
Why Tokens Drive Your Bill
Every major LLM provider prices by the token. OpenAI, Anthropic, Google, Mistral, all of them.
The structure is always the same: one rate for input tokens (the prompt you send), one rate for output tokens (the response the model generates), with output tokens typically costing 3 to 5 times more than input tokens.
This has immediate design implications.
Prompt length is a cost lever. Every word in your system prompt, every retrieved document in your RAG context, every few-shot example, every piece of conversation history: all of it is billed on every single API call. A verbose system prompt that you write once and forget about can quietly dominate your API costs at scale.
Output length needs active management. The model will happily generate long, rambling responses if you let it. Setting max_tokens is not just a safety measure. It is a direct cost control. An application that averages 1,500 output tokens per response when it could average 400 is paying nearly 4x more for the same core value.
Caching is not a nice-to-have. Prompt caching mechanisms, now offered by most providers, let you pay a reduced rate for tokens you have already sent. For any application with a stable system prompt or repeated context, this is often the single biggest cost optimization available.
The practical mental model: tokens are a currency. Treat them the way you would treat any other metered resource. Measure consumption, identify waste, optimize the worst offenders.
Why Tokens Define Your Context Window
Every LLM has a context window: the maximum amount of text it can process in a single call. Modern models advertise context windows of 128k, 200k, even 1 million tokens.
These numbers are not word counts. They are token counts. And the difference matters.
For English text, the rough rule of thumb is that 1 token corresponds to approximately 0.75 words, or about 4 characters. So a 128k token context window fits approximately 96,000 words of English prose.
In practice, your effective context is always smaller than the advertised number, because:
- The system prompt takes tokens
- The conversation history takes tokens
- The retrieved context (in RAG systems) takes tokens
- The user's current query takes tokens
- The response you are asking the model to generate also consumes tokens from the same budget
And, as I discussed in the article on context window limitations, the quality of the model's output degrades well before the hard token limit. A 200k context window does not mean you can use 200k tokens effectively. You cannot.
Understanding tokens is the first step to understanding why context engineering matters. You are not managing text. You are managing a token budget, and every design decision trades tokens for something else.
Why Tokens Determine RAG Chunking
In a RAG system, you split documents into chunks and retrieve the most relevant ones at query time. The size of those chunks is almost always specified in tokens.
The reason is practical: chunks need to fit into the context window alongside the prompt, the system instructions, and the model's response. If your chunks are sized in characters or words, you cannot reason about the total context budget accurately.
But the deeper reason is semantic. Tokens are the unit the model actually processes. Chunking at arbitrary character boundaries can split a single token across two chunks, producing noise. Chunking with awareness of token boundaries produces cleaner, more coherent inputs.
If you are designing a retrieval pipeline:
- Specify chunk sizes in tokens, not characters
- Account for the tokenizer your embedding model uses, which may differ from the tokenizer of the generating model
- Budget carefully: if you retrieve 5 chunks of 500 tokens each, you have already spent 2,500 tokens of context before you write a single word of your prompt
This is one of the reasons RAG systems often feel surprisingly tight on context. The token math adds up faster than intuition suggests.
The Multilingual Tax: Where Tokenization Gets Unfair
Here is where tokenization stops being a technical curiosity and becomes a structural problem.
Most modern LLM tokenizers are trained on datasets that are heavily biased toward English. The statistical merging process that creates the vocabulary has seen far more English text than any other language. As a result, English gets highly efficient encodings. Other languages, not so much.
The practical consequence: the same sentence, translated into different languages, produces dramatically different token counts.
Take a simple example. The English sentence "Artificial intelligence is transforming how we work" tokenizes to roughly 8 tokens with a GPT-4 tokenizer.
The Italian translation, "L'intelligenza artificiale sta trasformando il nostro modo di lavorare", tokenizes to approximately 15 tokens. Nearly twice as many.
The same phenomenon applies, with varying severity, across most non-English languages, but the picture is more nuanced than a simple "English is cheap, everything else is expensive" rule:
- Italian and French typically cost 30 to 50 percent more tokens than equivalent English
- German tends to cost 40 to 60 percent more, partly because of long compound words
- Japanese, with its mix of kanji and kana, costs on average about 2x English and can reach 8x in pathological cases
- Arabic and other non-Latin scripts typically cost 2 to 3 times more
- Some low-resource languages (Burmese, Dzongkha, Shan, several African and South Asian languages) can cost 5 to 12 times more than English
Chinese is the interesting exception. With older tokenizers (like GPT-4's cl100k_base), Chinese was tokenized inefficiently: one character often mapped to two or even three tokens. With GPT-4o's newer o200k_base tokenizer, which has a 200,000-token vocabulary instead of 100,000, Chinese improved dramatically. Some Chinese phrases that used to take 12 tokens now take only 2. And because each Chinese character is semantically denser than an English word, the equivalent meaning in Chinese can actually use fewer total tokens than English for the same content.
The takeaway is not "Chinese is always cheaper." It is that tokenizer design matters enormously, and recent tokenizers have shifted the economics for some languages more than others. If you are operating in Chinese, upgrading from an older tokenizer family to a newer one can cut your token costs significantly. If you are operating in most other non-English languages, the tax is still very much there.
This is not a theoretical concern. It is a real operational inequality that affects:
Cost. An Italian-language application pays 30 to 50 percent more per request than its English equivalent for the same business logic. Across millions of requests, the delta is substantial.
Context. A 128k context window fits far less Italian than it fits English. When you are stuffing retrieved documents into the context, the effective capacity is smaller in non-English languages.
Latency. More tokens means more inference time. Non-English responses are not just more expensive to generate; they are also slower.
Quality. There is evidence that models perform slightly worse on tasks in languages they tokenize inefficiently, because the same amount of "reasoning budget" in tokens corresponds to less actual content.
If you are building for a non-English market, you need to account for this. Pricing models built on English benchmarks will underestimate your true per-request costs. Context budgets designed for English will feel cramped in your target language. Latency SLAs that assume English response times will miss.
The tokenizer tool on this site makes this observable in seconds. Paste the same paragraph in two languages and compare the token counts. The disparity is not a rounding error. It is a tax.
What Tokens Do Not Tell You
Tokens are a powerful unit of measurement, but they are not a proxy for meaning.
Two prompts with the same token count can have wildly different quality. Two responses with the same token count can provide vastly different value. Token counts say nothing about clarity, relevance, or correctness. They only say how much the model had to process.
This is worth remembering, because it is easy to slip into treating tokens as a proxy for everything. They are not. They are a resource, not a metric of output quality.
A shorter prompt is not necessarily a better prompt. A longer response is not necessarily more valuable. Optimize tokens as a cost lever, not as a quality signal.
How to Think About Tokens in Practice
If you are designing or operating an LLM-based system, a few principles are worth internalizing.
Measure, do not estimate. Never assume you know the token count of a piece of text. Use a tokenizer. The Tokenizer tool on this site is built for exactly this: paste your prompt, your retrieved context, your output samples, and see the real numbers.
Instrument your pipeline. Log the token counts of inputs and outputs in production. Track them over time. A slow drift in average prompt length is often the first sign of prompt bloat.
Audit by language. If you serve multiple languages, measure the token cost per language separately. Do not assume your English unit economics generalize.
Budget per component. Break down your context budget explicitly: system prompt, instructions, retrieved context, conversation history, expected output. Make the allocations intentional.
Optimize the expensive paths. Identify the 10 percent of your prompts that consume the most tokens and focus optimization efforts there. Prompt engineering, retrieval tuning, and output length control compound when applied to high-volume paths.
The Takeaway
Tokens are the unit of everything in LLM systems: cost, context, latency, chunking, fairness.
Most teams building with LLMs develop an intuition for tokens slowly and painfully, usually after a billing surprise or a context overflow in production. The better path is to build that intuition deliberately, from the start.
Pasting a few paragraphs into a tokenizer and watching the count change is a five-minute exercise that reframes how you think about every subsequent design decision.
If you have not already, try the Tokenizer. Paste a prompt. Paste a translation. Paste a retrieved document. Watch the numbers.
The text you write and the text the model sees are not the same thing. Once you see the difference, you will design differently.
Building production AI systems? I write regularly about applied AI engineering, system architecture, and the real lessons from production deployments. Find me on LinkedIn or reach out directly at ciao@pavlo.sh.