Chunking Strategies for RAG: The Decision That Makes or Breaks Retrieval

Most RAG failures are blamed on the wrong thing.

When a system retrieves the wrong passage, the instinct is to swap the embedding model, add a reranker, or upgrade to a bigger LLM. Sometimes that helps. Often it doesn't, because the damage was done much earlier, at the moment the document was cut into pieces.

The chunk is the unit of retrieval. It is the smallest thing your system can find, score, and hand to the model. If the answer to a question is split across two chunks, neither one scores high enough to be retrieved, and the answer that exists in your corpus becomes invisible. No embedding model recovers information that chunking threw away.

This is why chunking is the decision that makes or breaks retrieval, and also why it's the one people spend the least time on. The canonical tutorial splits documents into 512-token blocks and moves on. That default works in a demo and quietly sabotages production.

This article is about doing it deliberately. The real trade-off underneath every chunking choice, the sizes and overlaps that actually work, and the full ladder of strategies, from boring-but-effective recursive splitting to late chunking and contextual retrieval, so you can pick the one your documents actually need instead of the one the tutorial happened to use.

Why Chunking Exists at All

You chunk for two reasons, and it helps to keep them separate.

The first is mechanical. Embedding models have a maximum input length, and a single dense vector can only represent so much meaning before it turns to mush. Embed an entire 40-page document into one vector and you get an average of everything it says, useless for matching a specific question. Smaller units produce sharper, more discriminative embeddings.

The second is about the generation step. Even with today's large context windows, stuffing whole documents into the prompt is expensive and counterproductive. More tokens cost more money and, as covered in why more tokens hurt your AI app's performance, longer context actively degrades accuracy through the "lost in the middle" effect. Retrieving tight, relevant chunks is how you keep the prompt short and the model focused.

So chunking serves retrieval precision and generation quality at the same time. The problem is that those two goals pull in opposite directions, and that tension is the heart of every decision you're about to make.

The Core Tension: Precision vs. Context

Here is the trade-off in one sentence:

Small chunks retrieve precisely but lose context. Large chunks preserve context but retrieve imprecisely.

A small chunk, say 128 tokens, is a tight semantic target. When a user asks a narrow factual question, a small chunk that contains exactly that fact will score very high, because the embedding isn't diluted by surrounding material. But that same small chunk may be useless to the model, because it's been severed from the sentence that defined the term it uses, or the condition that qualifies the claim it makes.

A large chunk, say 1,024 tokens, carries its own context. The model gets the surrounding explanation, the caveats, the definitions. But the embedding now represents a dozen different ideas at once, so it matches many queries weakly instead of one query strongly. Retrieval precision drops, and you start surfacing chunks that are topically adjacent but don't actually answer the question.

Every chunking strategy is, at bottom, an attempt to cheat this trade-off: to get the retrieval precision of small chunks and the contextual completeness of large ones at the same time. The naive strategies pick a point on the line. The good strategies bend the line.

Chunk Size and Overlap: The Numbers That Actually Work

Before reaching for a clever strategy, get the boring parameters right. They matter more than most people expect.

Chunk size. For most general-purpose retrieval, 256–512 tokens is the sweet spot. It's small enough to stay precise for fact-style questions and large enough to carry a coherent thought. As a rough heuristic:

Content type	Suggested chunk size	Suggested overlap
Fact-dense / Q&A / policies	256–512 tokens	10–15%
General prose / articles	300–500 tokens	10–20%
Narrative / long-form reasoning	500–1,000 tokens	15–20%
Code and highly technical text	100–200 tokens	15–25%
Conversational / chat transcripts	split by turn or topic	minimal

There's a second axis hiding here: query type. Fact-based queries favor smaller chunks (256–512), while analytical queries that require synthesizing across material favor larger ones (512–1,024). If you know your traffic skews one way, bias your defaults accordingly.

Overlap. Overlap means each chunk repeats a slice of the previous one, typically 10–20% of the chunk size, so roughly 50–100 tokens on a 512-token chunk. The point is to stop a sentence or idea that lands on a boundary from being orphaned: if a key fact sits at the seam between chunk N and chunk N+1, overlap ensures at least one chunk contains it whole. Too little overlap and you sever ideas at boundaries; too much and you bloat the index with duplicated content that dilutes retrieval signal and inflates cost.

Count in tokens, not characters. "512" only means something if you're measuring the unit the model actually consumes. Splitting on character counts produces chunks that vary wildly in real size across languages and content types, a point that follows directly from how tokenization works and why it behaves so strangely across languages. Use a token-accurate splitter.

Get these three right and you've already beaten most production RAG systems. Now let's climb the ladder of strategies.

The Chunking Ladder

Like most architecture decisions in applied AI, chunking is best approached as a ladder: start with the cheapest strategy that could work, and only climb when your retrieval metrics prove you need to. Here are the rungs, from simplest to most sophisticated.

1. Fixed-Size Chunking

Split every N tokens. No awareness of structure, sentences, or meaning. Just cut at the counter.

It's the fastest and simplest approach, and it's genuinely fine for prototyping. But it slices through sentences, tables, and ideas with equal indifference, producing chunks that begin mid-thought and end mid-word. Use it to get a pipeline running, then replace it. It should not survive into production.

2. Recursive Chunking (The Default You Should Start With)

Recursive character splitting is the strategy most production systems should begin with, and many never need to leave.

Instead of blindly cutting at N tokens, it tries to split at the most natural boundary available and only falls back to a coarser one when a chunk is still too big. The hierarchy is typically: paragraphs first (double newline), then single newlines, then sentences, then words. It respects the document's structure as far as the size limit allows, then breaks the rule only as a last resort.

The result is chunks that mostly start and end on real boundaries, at a computational cost that's indistinguishable from fixed-size. In head-to-head benchmarks it's repeatedly the strongest cheap option, landing near the top on end-to-end accuracy while being fast and trivial to run. Combined with a sensible 512-token size and 10–20% overlap, recursive splitting is the baseline every other strategy has to beat.

3. Semantic Chunking

Semantic chunking decides boundaries by meaning rather than length. It embeds consecutive sentences and measures the similarity between them; when the similarity drops sharply, signaling a topic shift, it places a boundary there. Chunks end where the subject changes, not where a token counter ticks over.

This produces highly coherent chunks and, on the right documents, the biggest accuracy lift of any pure splitting method. The catch is cost: semantic chunking has to run an embedding model across the whole corpus just to decide where to cut, which can be an order of magnitude slower and more expensive than recursive splitting. It shines on dense, topic-shifting material (research papers, legal documents, technical manuals) where coherence genuinely drives retrieval quality. On clean, well-structured prose, recursive splitting often gets you most of the benefit for a fraction of the price.

4. Document-Aware (Structure-Aware) Chunking

Prose and structured content should not be chunked the same way. A Markdown export has headings. Code has functions. A spec has tables. A fixed or recursive splitter will happily cut a table in half mid-row or split a function from its signature, and the resulting chunks are nonsense.

Document-aware chunking uses the document's own structure as the boundary system: split on Markdown headers, keep code blocks and functions intact, treat each table row-group as a unit, preserve list items. This connects directly to a failure mode covered in RAG in production: the moment you treat a contract, a codebase, and a product manual identically is the moment retrieval quality starts to break. Match the splitter to the format.

5. Hierarchical / Parent-Document Retrieval

This is the first strategy that genuinely cheats the precision-vs-context trade-off, and it's the one I reach for most often on complex documents.

You maintain two representations of the same content. Small child chunks (precise, used for retrieval) are linked to larger parent chunks (contextual, used for generation). At query time you search over the small chunks for a precise match, but once a child chunk hits, you pass its parent to the model. The small chunk surfaces the answer; the large chunk supplies the surrounding context the model needs to use it correctly.

You get the retrieval precision of small chunks and the contextual completeness of large ones, the whole point of the exercise. The cost is index complexity: you're storing and maintaining two layers and the links between them. For dense technical documents where the matching detail and the explanatory context live a paragraph apart, it's consistently worth it.

6. Late Chunking

Late chunking inverts the usual order of operations, and the idea is genuinely clever.

In normal ("early") chunking you split the document first, then embed each chunk independently. The problem: each chunk is embedded in total isolation. A chunk that says "It must not exceed this threshold" has no idea what "it" or "this threshold" refers to. The antecedents were in the previous chunk, and the embedding model never saw them.

Late chunking flips this. It runs the entire document (or a long passage) through a long-context embedding model first, producing token-level embeddings that have already attended to the full document. Then it splits, pooling the token embeddings within each chunk's boundaries into a chunk embedding. Because every token was computed with the whole document in view, each chunk's vector carries the long-range context (resolved pronouns, references, thematic continuity) that independent embedding throws away.

The elegant part: unlike the next strategy, late chunking needs no extra LLM and no rewriting of your text. It's a change to how you embed, not what you store. The requirement is a long-context embedding model, which is increasingly standard. When your documents are full of references that span boundaries, this is one of the highest-leverage upgrades available.

7. Contextual Retrieval

Contextual retrieval, introduced by Anthropic, attacks the same lost-context problem as late chunking but from the opposite direction: instead of fixing the embedding, it fixes the text before embedding.

For each chunk, a fast LLM is given the whole document and the target chunk, and asked to write a short situating preamble, typically 50–100 tokens explaining what this chunk is and where it sits in the document ("This section describes the refund window within the Q3 returns policy…"). That generated context is prepended to the chunk before it's embedded and before it's indexed for keyword (BM25) search. The chunk now carries its own context into both retrieval channels.

The numbers are the reason people pay attention. In Anthropic's evaluation, measured as the failure rate of the top-20 retrieved chunks:

Technique	Top-20 retrieval failure rate	Reduction
Baseline (standard embeddings)	5.7%	0%
+ Contextual embeddings	3.7%	35%
+ Contextual embeddings & BM25	2.9%	49%
+ Reranking on top	1.9%	67%

The obvious objection is cost: you're running an LLM call over every chunk in your corpus. The thing that makes it practical is prompt caching: you cache the full document once and reuse it across all of its chunks, which drops the price to roughly a dollar per million document tokens. It's a one-time indexing cost, not a per-query one. The trade-off is an extra processing stage in your ingestion pipeline and a dependency on a generation model at index time, but for high-value corpora the retrieval gain is hard to argue with.

8. Agentic Chunking

The frontier rung: hand the document to an LLM and let it decide the boundaries itself, reasoning about where ideas begin and end the way a human editor would. It can produce excellent, semantically clean chunks. It's also the slowest and most expensive option by a wide margin, and rarely justified unless document structure is so irregular that nothing else works. Know it exists; reach for it last.

How to Actually Choose

The ladder is not a checklist to climb to the top. It's a sequence to walk only as far as your documents force you. Here's the decision path I'd actually follow.

Start at recursive. Recursive splitting at 512 tokens with 10–20% overlap, counted in tokens, with a structure-aware splitter for any non-prose formats (code, Markdown, tables). This is the floor, and for a surprising number of corpora it's also the ceiling.

Climb based on the symptom, not the hype. Each higher rung fixes a specific failure, so diagnose before you build:

Symptom	Likely fix
Tables, code, or lists get mangled	Document-aware chunking
The match is found but the model lacks surrounding context	Hierarchical / parent-document retrieval
Chunks full of unresolved "it / this / the above" references	Late chunking
Right topic retrieved, wrong specific chunk; ambiguous short chunks	Contextual retrieval
Topic-shifting dense documents, coherence is poor	Semantic chunking

Match the strategy to the document type, not the other way around. A legal contract, a Python repository, a product manual, and a Slack archive each want a different approach: chunk size, boundary logic, and overlap all shift with the material. There is no universal chunker, and the search for one is where a lot of teams quietly lose months.

Critically, these strategies compose. Document-aware splitting feeds hierarchical retrieval. Contextual retrieval stacks with reranking. Late chunking pairs with parent-document retrieval. The question is rarely "which one." It's "which combination clears the bar my metrics set."

You Can't Choose Without Measuring

Everything above is just hypotheses until you put numbers on it. Chunking is the most measurable decision in your RAG stack, which makes guessing at it inexcusable.

Build a small retrieval evaluation set: 50–100 representative queries, each labeled with the chunk(s) that actually contain the answer. Then measure how often your chunking-plus-retrieval setup surfaces the right chunk in the top-K, recall@K, along with mean reciprocal rank for how high it ranks. When you change chunk size, swap recursive for semantic, or add contextual retrieval, re-run the same set and watch the number move. A jump from 0.71 to 0.86 recall@5 is the difference between a system users distrust and one they rely on, and it's almost always a chunking change that produces it.

This is the same discipline laid out in why your demo works but production fails: without an eval harness, you are tuning chunking by vibes, and vibes don't catch the regression you'll ship next week. Decide chunking with a dataset, not an opinion.

Do You Even Need to Chunk?

A fair question to ask before any of this: if context windows are huge now, why not skip retrieval and drop the whole document into the prompt?

Sometimes you should. If your entire knowledge base comfortably fits in context and stays there cheaply, chunking and retrieval add complexity you may not need. But that's a narrow case. The moment your corpus is larger than a context window, or you're serving many queries where paying for the full document every time is wasteful, or accuracy starts sliding because the model is drowning in irrelevant tokens, retrieval, and therefore chunking, comes back. Long context and RAG are not rivals; they're tools for different scales. This is part of the broader RAG vs. fine-tuning vs. prompting decision, and chunking only enters the picture once you've decided retrieval is the right layer to be working at.

A Practical Checklist

Before you call your chunking strategy production-ready:

[ ] Chunks are measured in tokens, not characters
[ ] Recursive (structure-respecting) splitting is the baseline, not fixed-size
[ ] Chunk size matches your dominant query type (256–512 for facts, larger for synthesis)
[ ] Overlap set to 10–20%, tuned against duplication cost
[ ] Code, tables, and Markdown handled by a structure-aware splitter, not cut mid-unit
[ ] Parent-document retrieval in place where matches and context live apart
[ ] Late chunking or contextual retrieval evaluated if chunks lose cross-boundary references
[ ] A labeled query set exists, and recall@K is measured on every chunking change
[ ] Re-chunking is treated as a routine operation, not a one-time setup

The Right Mental Model

Chunking feels like preprocessing: a thing you do once, early, before the "real" RAG work begins. That framing is exactly why it's the most common silent failure in production systems.

The truth is the opposite. Chunking decides the ceiling of everything downstream. Your embedding model can only match what your chunks expose. Your reranker can only reorder what retrieval surfaced. Your LLM can only reason over what it was handed. Every component after chunking is bounded by the quality of the units chunking produced, and none of them can recover information that bad boundaries destroyed.

So treat it as a first-class design decision, measured and revisited like any other. Pick the strategy your documents actually need, prove it with recall@K, and climb the ladder only when the metrics tell you to. Do that, and you've fixed the part of the RAG stack that most teams never realize was broken.

The model is rarely why your retrieval fails. More often, it's where you drew the lines.

Building production RAG systems? I write regularly about applied AI engineering, retrieval, and the real lessons from production deployments. Find me on LinkedIn or reach out directly at ciao@pavlo.sh.