Everyone's building RAG systems right now.
The pitch is clean: take your documents, embed them into a vector database, retrieve the relevant chunks at query time, and hand them to an LLM. Better answers, grounded in your actual data.
A weekend project becomes a proof of concept. The demo works beautifully.
Then you try to deploy it to production.
And that's when you discover that RAG is not a solved problem. It's a set of engineering decisions, each with non-obvious trade-offs, and most tutorials quietly skip the hard parts.
This article covers what those hard parts actually are, and what you need to know before your system has to serve real users at real scale.
The Demo-to-Production Gap
The canonical RAG tutorial looks like this:
- Load a PDF
- Split it into 512-token chunks
- Embed with
text-embedding-ada-002 - Store in a vector database
- At query time, retrieve the top 5 chunks
- Stuff them into a prompt
- Get a great answer
This works in controlled conditions. The document is clean. The query matches the document language closely. There's one document, maybe five.
In production:
- Documents are noisy (scanned PDFs, HTML exports, tables, headers, footers)
- Queries use different vocabulary than the documents
- There are hundreds or thousands of documents with overlapping content
- Users ask questions that span multiple documents
- Some queries have no relevant answer in the corpus, but the system answers anyway
The architecture is the same. The failure modes are completely different.
Failure Mode 1: Chunking Is Not a Free Parameter
Most implementations treat chunking as a default setting, 512 tokens, fixed overlap, done. This is one of the most common sources of retrieval failure.
The core problem: a chunk is the unit of retrieval. If the relevant information spans two chunks, neither chunk individually will score high enough to be retrieved. The answer exists in your corpus. Your system cannot find it.
Consider a technical specification where a crucial constraint is defined on page 3 and referenced in a table on page 7. A fixed-size chunker will cut across both, making the semantic connection invisible to the retrieval system.
What actually works
Semantic chunking splits on meaningful boundaries, such as paragraphs, sections, or sentence clusters, rather than token counts. It produces chunks with higher internal coherence and fewer semantic splits across boundaries.
Hierarchical chunking maintains two representations: small chunks for precise retrieval, and larger parent chunks that are passed to the model once a match is found. The small chunk surfaces the match; the large chunk provides the context. This is sometimes called parent-document retrieval and consistently outperforms flat chunking on complex documents.
Document-aware chunking handles structured content differently from prose. Tables, code blocks, and lists should be kept intact whenever possible, not split mid-row or mid-function.
The right chunking strategy depends on your document type. A legal contract, a codebase, a product manual, and a collection of Slack messages each require a different approach. Treating them identically is the first place retrieval quality breaks down.
Failure Mode 2: Embedding Models Are Not Interchangeable
The embedding model is the core of your retrieval system. It determines what "similar" means.
General-purpose embedding models are trained on broad internet text. They work reasonably well for general queries. They often fail for domain-specific content, including medical, legal, financial, or highly technical domains, because the semantic relationships in those domains differ from what the model learned.
The practical consequence: a query about "acceleration clauses in loan agreements" may retrieve chunks about "car acceleration" before it retrieves the relevant legal clause, because the embedding space hasn't learned that "acceleration" in a financial context means something entirely different.
What to watch for
Run a retrieval evaluation before you go to production. Take 50–100 representative queries from your target use case, identify the expected relevant chunks manually, and measure how often your retrieval system finds them in the top-K results. This is called recall at K, and it's the metric that actually matters.
A general embedding model might score 0.65 recall@5 on your domain-specific corpus. A fine-tuned or domain-appropriate model might score 0.85. That difference is not theoretical. It is the difference between a system that mostly works and one that users trust.
If you're working in a specialized domain, evaluate domain-specific embedding models. For code, models trained on code repositories significantly outperform general models. For biomedical text, models trained on PubMed or clinical notes will outperform text-embedding-ada-002.
Failure Mode 3: Retrieval Quality Degrades Over Time
Your system works well at launch. Three months later, users are complaining. Nothing changed in the code.
What changed: the document corpus grew.
This is embedding drift at the corpus level, and it's one of the most underappreciated maintenance problems in production RAG systems.
As you add documents:
- More chunks compete for the top-K slots
- The average relevance of top-K results decreases
- Documents from early in the corpus may become harder to retrieve if they're underrepresented compared to newer content
- Duplicate or near-duplicate content accumulates, diluting retrieval signal
What to do about it
Re-evaluate regularly. Run your retrieval benchmarks on a schedule. If recall@5 drops from 0.82 to 0.71 over two months, you need to know before users notice.
Implement deduplication. Before indexing, check for near-duplicate chunks using cosine similarity thresholds. Storing five near-identical chunks about the same policy update helps no one.
Segment your corpus deliberately. Not all documents should compete in the same retrieval pool. If users are asking operational questions, don't let marketing copy and archived memos compete with your runbooks and product documentation. Namespace or filter by document type where possible.
Monitor document freshness. Stale documents that contradict newer ones will get retrieved. A retrieval system with no concept of document recency will confidently return outdated information. Build metadata filtering into your retrieval layer from the start.
Failure Mode 4: The Reranking Gap
Vector similarity is a good first approximation of relevance. It is not the same as relevance.
Cosine similarity measures geometric proximity in embedding space. It doesn't understand:
- Which chunk is more specific to the query
- Whether a chunk actually answers the question or just mentions related terms
- The relative authority or quality of different sources
The practical result is that the top retrieved chunk is often not the most relevant chunk for answering the user's question. It's simply the one that happened to be geometrically closest.
The solution: two-stage retrieval
The most reliable production RAG architectures use a two-stage pipeline:
Stage 1 (recall): Retrieve a larger set of candidates, typically 20 to 50 chunks, using fast vector search. Optimize for recall here and cast a wide net.
Stage 2 (precision): Pass the candidates to a cross-encoder reranker, which scores each candidate against the query as a pair. Rerankers are more computationally expensive but significantly more accurate, because they can evaluate the chunk and query together rather than independently.
Rerankers like Cohere's reranking endpoint, or open-source models like BGE-Reranker or ms-marco cross-encoders, consistently improve end-to-end answer quality by 15–30% in controlled evaluations.
The trade-off is latency and cost. A two-stage retrieval adds 100 to 300 ms and an additional API call or model inference step. For most production use cases, this is worth it. For latency-sensitive applications, you can apply reranking selectively, for example only when the top-1 vector score is below a confidence threshold.
Failure Mode 5: The Context Assembly Problem
You have the right chunks. The LLM still gives the wrong answer.
This happens more often than it should, and it connects directly to the context window degradation problem described in a previous article on this blog.
How you assemble your retrieved chunks into the prompt matters as much as which chunks you retrieve.
Ordering: Models exhibit primacy and recency bias. The most important chunks should appear at the beginning or end of the context block, never buried in the middle of a long list of retrieved documents.
Redundancy: If five retrieved chunks all say essentially the same thing, the model doesn't benefit from the repetition, but it does pay the cost in tokens and attention dilution. Deduplicate retrieved content before passing it to the model.
Source metadata: Always include source attribution in your context. Tell the model where each chunk came from. This enables the model to reason about conflicts between sources and improves citation accuracy in the output.
Negative space: When no relevant document is found, your system should say so. A RAG system that answers confidently when retrieval fails is worse than one that admits uncertainty. Build explicit low-confidence pathways into your architecture.
Failure Mode 6: Evaluation Is an Afterthought
Most RAG systems are evaluated informally: a few test queries, a judgment call on whether the answers look reasonable, a thumbs up.
This approach scales to demos. It doesn't scale to production.
The problem is systematic: informal evaluation doesn't catch the long tail of failure modes, doesn't detect regressions when you change chunking strategy or swap embedding models, and doesn't give you the data to make principled architecture decisions.
What a production evaluation pipeline looks like
Retrieval evaluation measures whether the right chunks are being surfaced. Use recall@K and mean reciprocal rank (MRR). You need a ground truth dataset of queries and their expected relevant chunks. This can be built manually or generated synthetically using an LLM and then validated.
Answer evaluation measures whether the final answer is correct and grounded in the retrieved context. The most practical approach is LLM-as-a-judge: use a capable model to rate each answer for correctness, faithfulness (does the answer reflect the context?), and relevance (does it address the query?). This is not a perfect evaluation method, but it scales and correlates reasonably well with human judgment.
Hallucination detection specifically checks whether the model's answer contains claims not supported by the retrieved context. This is the most important safety property of a RAG system: if the model is going to say something, it should be grounded.
Build this pipeline before you go to production, not after your first major incident.
The Latency Budget
RAG adds latency. How much depends on your architecture, but a naive implementation can easily add 500–1500ms to every query.
The breakdown typically looks like this:
| Step | Typical Latency |
|---|---|
| Query embedding | 20–80ms |
| Vector search | 10–50ms |
| Reranking (optional) | 100–300ms |
| Context assembly | < 10ms |
| LLM generation | 500–3000ms |
The LLM generation dominates. But the retrieval stack still matters, especially if you're serving many concurrent users.
Caching query embeddings is low-hanging fruit. Many production queries are semantically similar or identical. If the same query (or a near-duplicate) has been seen before, return the cached retrieval result.
Asynchronous retrieval can parallelize embedding and search with other lightweight preprocessing steps, reducing perceived latency.
Approximate nearest neighbor (ANN) search is faster than exact search and generally sufficient for retrieval tasks. Most production vector databases use ANN by default, but verify the configuration. Exact search settings in development environments will not match production performance.
What Actually Works: A Production Checklist
Before you consider a RAG system production-ready:
Chunking:
- [ ] Chunking strategy matches document type (semantic boundaries, not fixed tokens)
- [ ] Parent-document retrieval implemented for complex documents
- [ ] Tables and code blocks are kept intact
Embeddings:
- [ ] Embedding model evaluated for your specific domain
- [ ] Recall@5 measured on representative query set (target: above 0.80)
Retrieval:
- [ ] Metadata filtering implemented (by date, document type, source)
- [ ] Deduplication in place before indexing
- [ ] Two-stage retrieval with reranking for quality-sensitive use cases
Context assembly:
- [ ] Most important chunks placed at beginning or end of context
- [ ] Source metadata included in context block
- [ ] Low-confidence fallback pathway when retrieval fails
Evaluation:
- [ ] Retrieval benchmark with ground truth queries
- [ ] LLM-as-a-judge pipeline for answer quality
- [ ] Hallucination detection layer
- [ ] Regression tests that run on every architecture change
Monitoring:
- [ ] Retrieval quality metrics tracked over time
- [ ] Corpus growth monitored (deduplication rate, document count)
- [ ] Latency per stage instrumented separately
The Right Mental Model
RAG is not a feature you add. It's an architecture you maintain.
The retrieval pipeline is a first-class component of your system, as important as the LLM itself, and in many cases more important, because the model cannot compensate for poor retrieval. Garbage in, garbage out still applies.
The teams that deploy reliable RAG systems treat retrieval quality as a product metric: they track it, alert on regressions, and invest in improving it continuously.
The teams that struggle treat RAG as infrastructure, set it up once, forget about it, and then wonder why the system degrades.
Most RAG demos look the same. Most production RAG systems perform very differently.
The difference is almost never the model. It's almost always the retrieval stack.
Building production AI systems? I write regularly about applied AI engineering, system architecture, and the real lessons from production deployments. Find me on LinkedIn or reach out directly at ciao@pavlo.sh.