At some point, every team running an LLM in production opens the monthly bill and asks the same question: why are we paying to process the same thing over and over again?
The system prompt does not change between requests. The RAG grounding documents do not change between the third and fourth question in a conversation. The tool definitions you send on every single agent turn are identical, byte for byte, to the ones you sent a second ago. And yet, by default, the model reprocesses all of it, every time, from zero.
Caching exists to fix exactly this waste. But "caching" in LLM engineering actually refers to two very different techniques that happen to share a name. Prompt caching is an infrastructure-level optimization: it reuses computation the model already did on an identical prefix. Semantic caching is an application-level shortcut: it reuses a past answer because the current question is similar enough, even if the wording is different.
One saves you money on tokens you were always going to send. The other tries to avoid calling the model at all. Confusing the two, or worse, implementing one while believing you built the other, is how teams end up with either underwhelming savings or a cache that quietly serves the wrong answer to a customer. This article covers both properly: how each one works, the real numbers from the major providers, and the failure modes that matter before you ship either of them.
Two Different Things Called "Cache"
Before going deeper, the distinction is worth stating plainly, because it resolves most of the confusion on its own.
Prompt caching works underneath the model. When you send a prompt, the model has to convert every token into an internal representation before it can generate anything, a step called prefill. If the beginning of your prompt is identical to a prompt you sent a few minutes ago, the provider can skip recomputing that part and reuse the internal state it already produced. You still get a fresh, fully generated response. You just do not pay full price, or pay full latency, for the part of the input that has not changed.
Semantic caching works above the model, in your application. It stores past question-and-answer pairs, and when a new question arrives, it checks whether a semantically similar question has already been answered. If the similarity crosses a threshold, it returns the stored answer directly and never calls the model at all. No prefill, no generation, no tokens billed on either side.
The practical difference matters a lot: prompt caching reduces the cost of input tokens on a call you are still making. Semantic caching, when it hits, removes the call entirely, saving both input and output tokens. That makes semantic caching look more powerful on paper, and it is, but it also introduces a risk that prompt caching does not have: prompt caching can never return a wrong answer, because the model still generates the response. Semantic caching can, because it is betting that "similar enough" means "the same answer applies."
Keep that asymmetry in mind. It explains almost every design decision in the rest of this article.
How Prompt Caching Actually Works
Every major provider now offers some version of prompt caching, and the mechanism is the same across all of them even if the pricing and controls differ.
When the model processes your prompt, it builds up an internal representation of every token it has seen, token by token, in order. This is expensive, it is the bulk of the compute cost on a typical request. If a later request starts with the exact same sequence of tokens, the provider can store that internal state from the first call and reuse it, skipping the recomputation for the shared portion. Only the new part of the prompt, the part that was not in the cached prefix, needs to go through full processing.
This has one direct consequence that shapes how you should write your prompts: caching works on prefixes, not on arbitrary matches. If your system prompt is identical across requests but you insert a timestamp or a random user ID before it, the "prefix" is no longer identical, and the cache misses. Static content has to come first. Variable content has to come last.
[ system instructions ] <- identical every request, put it here
[ tool definitions ] <- identical every request
[ retrieved documents ] <- changes occasionally
[ conversation history ] <- grows over time
[ user's current message ] <- different every request, always last
Get that ordering backwards, for example by putting today's date or a per-request identifier before your instructions, and you will pay full price on every call while believing you are cached.
What Each Provider Actually Offers
The three major providers implement prompt caching differently enough that the specifics matter if you are optimizing costs seriously.
| Provider | How it activates | Cache read discount | Cache write cost | Typical TTL |
|---|---|---|---|---|
| Anthropic (Claude) | Manual, via cache breakpoints in the request | ~90% off input price | ~1.25x input price (5 min), ~2x (1 hour) | 5 minutes by default, 1 hour optional |
| OpenAI (GPT) | Fully automatic above 1,024 tokens | ~50% off input price | Free (no separate write cost) | 5 to 10 minutes of inactivity, longer under load |
| Google (Gemini) | Automatic (implicit) or manual (explicit) | Up to ~90% off input price | Free for implicit, storage billed hourly for explicit | Session-based (implicit) or configurable (explicit) |
A few things worth internalizing from that table.
OpenAI made this a non-decision. Above roughly 1,024 tokens, caching kicks in automatically, there is no write penalty, and you get a straight 50% discount on the reused portion. There is no scenario where enabling it costs you anything, so the only real work on your side is structuring prompts so the static part actually stays identical across calls.
Anthropic gives you more control and a bigger discount, at the cost of doing the accounting yourself. You explicitly mark up to four "breakpoints" in your request, and the system reuses whatever it finds cached up to and including that breakpoint. The trade-off is a write premium: the first time a given prefix gets cached, you pay more than the base rate for it, not less. That premium pays for itself the moment the same prefix gets reused twice, which for a live system with a stable system prompt happens within seconds. The one-hour TTL option exists specifically for lower-traffic endpoints where five minutes between requests is not a safe assumption.
Gemini splits the difference. Implicit caching is automatic and free to set up, similar in spirit to OpenAI's approach. Explicit caching gives you Anthropic-style control, including the ability to pre-load a large, expensive context (think a full codebase or a long reference document) and keep it warm for hours, but you pay an hourly storage fee for the privilege, so it only makes sense when the reuse volume justifies it.
None of these numbers should be treated as fixed forever. Providers adjust pricing and thresholds regularly. The structure, a discount on reads and a break-even calculation on writes, is the part that is stable enough to design around.
Where Prompt Caching Pays Off the Most
Prompt caching is not equally valuable everywhere. It concentrates its benefit on a specific pattern: large, stable content that gets resent on every call. A few cases stand out.
Long system prompts. Once an application matures, its system prompt tends to grow. Role definition, tone guidelines, output format rules, safety instructions, a handful of few-shot examples. That block can easily run into the thousands of tokens, and it is identical on every single request. This is the single easiest win available, and it is why the token-cost discipline I wrote about separately pays off twice here: once from trimming the prompt, and again from caching what remains.
Agentic tool loops. This is the case people underestimate the most. An agent that calls tools in a loop resends the entire conversation history, including every previous tool call and every previous tool result, on every single turn, because the model has no memory between calls. A ten-step agent trajectory means the tenth call carries the full weight of the previous nine. Without caching, that cost grows roughly with the square of the number of steps. With caching, each turn only pays full price for the new increment, and everything before it is a 90 percent (or 50 percent) discount. If you are running long agent loops and have not enabled caching, this is very likely your single biggest optimization opportunity, ahead of model choice.
RAG systems with reused context. When a conversation stays grounded in the same retrieved documents across several follow-up questions, whether because the user is asking multiple things about the same report or because your retrieval step returns a stable top result, caching the retrieved context avoids re-billing it on every turn. This compounds with the retrieval-quality work covered in RAG in production: a well-tuned retrieval step that returns consistent context is also a cheaper one to run, because it is more cacheable.
Multi-turn conversations in general. Every additional turn in a chat carries the full history forward. Caching turns what would be a linearly growing cost per message into something closer to flat, because only the newest message pays full price.
Where prompt caching does not help: single-shot requests with no repetition, prompts that are mostly unique user content, or anything below the minimum token threshold (roughly 1,024 tokens on most providers, sometimes higher depending on the model). If your prompts are short and every one is different, there is nothing here for you to reuse.
How Semantic Caching Works
Semantic caching operates on a completely different premise: instead of matching exact token prefixes, it matches meaning.
The flow looks like this. When a request comes in, before calling the LLM at all, the application embeds the incoming query into a vector, using the same kind of embedding model discussed in the guide to choosing embeddings and vector databases. It then searches a vector store of previously answered queries for the closest match. If the similarity score clears a defined threshold, say 0.95 on a cosine similarity scale, the cached answer is returned immediately, with no call to the model at all. If nothing clears the threshold, the request goes to the LLM as normal, and the new question-and-answer pair gets embedded and stored for next time.
A reasonable production flow layers three checks in order of cost, cheapest first:
- Exact match on the literal query string, essentially free and instant.
- Semantic match via vector similarity search, a few milliseconds of overhead.
- Cache miss, fall through to the LLM, then write the new pair back into the cache.
The appeal is obvious for any application with repetitive query patterns: a customer support bot answering the same handful of questions phrased a hundred different ways, an internal tool answering the same class of question from different employees, a documentation assistant fielding "how do I reset my password" and "I forgot my password, help" as if they were the same question, because to a good embedding model, they are.
In practice, teams build this with a vector store that supports fast approximate nearest neighbor search, Redis being a common choice given how many teams already run it for other purposes, alongside purpose-built libraries like GPTCache, or through a caching layer bundled into an LLM gateway product like Portkey. The mechanics are simple enough that a basic version is a weekend project. Getting the threshold and the scoping right, which is where the actual risk lives, is not.
Why Semantic Caching Is Riskier Than It Looks
This is the part of the article that matters more than the setup instructions, because the failure modes of semantic caching are silent. A wrong answer served from cache looks exactly like a right answer. Nothing crashes. Nothing logs an error. It just quietly tells a user something incorrect with full confidence, and it will keep doing it for everyone who asks a similar-enough question until someone notices.
False positives from similarity thresholds. Two questions can be extremely close in embedding space and require completely different answers. "Can I cancel my subscription" and "can I cancel my order" sit close together semantically but demand different responses. Push the threshold too low to get a decent hit rate, and you start returning answers to the wrong question. There is a rough rule of thumb worth knowing: once your false positive rate creeps past 3 to 5 percent, the problem is usually not the threshold, it is the embedding model itself not being precise enough for your domain, and no amount of threshold tuning fixes that.
Staleness. A cached answer was correct when it was generated. It is not guaranteed to still be correct now. Prices change, policies change, inventory changes, documentation gets updated. A semantic cache with no invalidation strategy will happily serve last week's discontinued-product answer to this week's customer. This needs the same discipline as any other cache: a TTL appropriate to how fast your underlying facts move, and ideally a way to invalidate entries tied to the specific data that changed, rather than waiting for a blanket expiry.
Contamination in multi-turn conversations. Caching gets meaningfully harder once you are not caching single questions but full conversation turns. Two unrelated conversations can produce highly similar embeddings purely because they share a long, generic opening exchange, and a semantic cache built naively on the whole history can match them and return an answer meant for a completely different context. If you cache in a multi-turn setting, scope the similarity check narrowly, ideally to the current turn plus a short, relevant slice of history, not the entire transcript.
Cache poisoning and adversarial inputs. A semantic cache that only checks similarity, with no other safeguard, is a target. A carefully worded query, close enough in embedding space to a previously cached but incorrect or manipulated response, can trigger that response deliberately. This is structurally the same class of problem as the trust issue described in the piece on prompt injection: once you let an external input steer what your system treats as ground truth, someone will eventually try to steer it somewhere you did not intend. A single bad or malicious entry in the cache does not just cause one wrong answer, it gets served to everyone who asks anything close to it, turning one mistake into many.
Cross-user data leakage. This is the one that turns an inconvenience into an incident. If a cached response contains anything specific to one user, an account balance, an order status, a personalized recommendation, and the cache is not scoped by user or tenant, a semantically similar question from a different user can retrieve someone else's private answer. Any semantic cache serving a multi-tenant application needs the user or tenant identity baked into the cache key, not just the query text. Never cache a response that contains personal or account-specific data unless the cache lookup is scoped to that exact identity.
None of this means semantic caching is a bad idea. It means it is not a drop-in optimization you enable and forget. It is closer to a second, lightweight retrieval system sitting in front of your model, and it deserves the same evaluation discipline you would apply to retrieval quality, covered in more depth in why your LLM demo works but production doesn't. Treat cache hit correctness as a metric you monitor, not an assumption you make once at launch.
Combining Both: A Layered Cache Stack
Prompt caching and semantic caching are not competing choices. They sit at different layers and solve different problems, and the strongest production setups run both at once.
A layered request path looks like this:
- Exact-match cache. Identical query, identical answer, instant return. Cheapest possible check.
- Semantic cache. Similar-enough query, return a stored answer if it clears the similarity and scoping checks. Skips the model call entirely.
- Prompt cache. On an actual miss that reaches the model, the static parts of the prompt (system instructions, tool definitions, grounding documents, prior conversation turns) are served from the provider's prompt cache instead of being reprocessed from scratch.
- Full generation. Only the genuinely new part of the request, and the response itself, costs full price.
Each layer catches what the one before it could not, and each layer is cheaper to check than the one after it. Teams that instrument hit rates separately for each layer are the ones who actually capture the savings, because "we added caching" without measurement tends to mean one layer is doing all the work while another sits at a near-zero hit rate, quietly adding latency for nothing.
A Practical Decision Framework
Before implementing either technique, a short set of questions:
Reach for prompt caching when:
- [ ] Your system prompt, tool definitions, or grounding context are large (over roughly 1,000 tokens) and repeat across calls
- [ ] You run multi-turn conversations or agentic tool loops where history grows every turn
- [ ] You need a correctness-neutral optimization, since the model still generates every response
- [ ] You are already on a provider that supports it, which by now is effectively all of them
Reach for semantic caching when:
- [ ] Your traffic has genuine repetition in meaning, not just in exact wording (FAQ-style support, common documentation questions)
- [ ] You can tolerate, and can monitor for, an occasional near-duplicate answer being reused
- [ ] Responses do not contain user-specific or rapidly changing information, or you can scope and invalidate correctly if they do
- [ ] You are willing to treat the similarity threshold as a tuned parameter you monitor, not a default you set once
Be cautious with semantic caching when:
- [ ] Answers involve pricing, medical, legal, financial, or safety-relevant content, where a near-miss is not an acceptable outcome
- [ ] The application is multi-tenant and responses can contain any account-specific data
- [ ] Underlying facts change frequently and you do not have a real invalidation strategy, not just a short TTL
Common Mistakes
- Enabling caching and never checking the hit rate. Both providers and semantic cache libraries expose the numbers you need (
cache_read_input_tokens, hit and miss counts). If you are not looking at them, you do not actually know whether caching is doing anything. - Putting variable content before static content. A timestamp, a request ID, or a "current date" string placed before the system prompt breaks the prefix match and silently disables prompt caching, while everything still appears to work correctly.
- Treating a semantic cache hit as equivalent to a fresh model response. It is a bet, not a guarantee. Log cache hits separately in your evaluation pipeline so you can measure their accuracy independently from live generations.
- Setting the similarity threshold once and forgetting it. The right threshold depends on your embedding model and your domain's tolerance for near-misses. It needs periodic review, especially after you change embedding models.
- Caching personalized responses without scoping the key. This is the mistake that turns a cost optimization into a privacy incident. Always include user or tenant identity in the cache key when the response could contain anything specific to that user.
- Assuming a 1-hour or long TTL is always better. A longer TTL increases your hit rate but also increases the odds of serving a stale answer. Match the TTL to how fast the underlying facts actually change, not to whatever maximizes cache hits.
Frequently Asked Questions
What is the difference between prompt caching and semantic caching?
Prompt caching reuses the internal computation for an identical prompt prefix, reducing the cost and latency of a request the model still fully processes and answers. Semantic caching reuses a previously generated answer when a new query is similar enough in meaning, skipping the model call entirely. Prompt caching can never return an incorrect answer, since generation still happens. Semantic caching can, since it is returning a stored answer to a question it judged to be close enough, not identical.
Does prompt caching reduce output token costs?
No. Prompt caching only reduces the cost of input tokens, specifically the portion of the prompt that matches a previously cached prefix. Output tokens, the response the model generates, are always billed at full price regardless of caching. Semantic caching is the technique that can reduce output costs too, because on a cache hit it avoids generation entirely.
Is prompt caching automatic or do I need to configure it?
It depends on the provider. OpenAI applies it automatically above a token threshold, with no code changes required. Anthropic requires you to mark explicit cache breakpoints in your request. Gemini offers both an automatic (implicit) mode and a manual (explicit) mode with more control and an hourly storage cost. Check your provider's current documentation, since defaults and thresholds change over time.
Is semantic caching safe to use in production?
It can be, but it requires the same rigor as any other retrieval system: a carefully tuned similarity threshold, monitoring for false positive hit rates, scoping cache keys by user or tenant when responses contain personal data, and an invalidation strategy for facts that change. It is not a safe default for content where a near-miss answer is unacceptable, such as pricing, medical, legal, or account-specific information, unless those safeguards are explicitly in place.
How much can caching actually save?
Prompt caching alone typically cuts the input-token cost of the cached portion by 50 to 90 percent, depending on the provider, on any request that reuses a stable prefix. Semantic caching can go further because a hit skips the model call entirely, saving both input and output tokens, with reported savings in the range of 30 to 70 percent of total inference cost for applications with genuinely repetitive query patterns. Layering both together, along with a simple exact-match cache, produces the largest gains, since each layer catches what the previous one missed.
Should I build semantic caching myself or use an existing tool?
A basic version, embed the query, search a vector index, return on a threshold match, is simple to build with any vector database. What is hard to get right on your own is the tuning: threshold calibration, per-tenant scoping, staleness handling, and monitoring for false positives. Existing libraries like GPTCache, Redis-based semantic caching layers, or caching features built into LLM gateway products handle much of that scaffolding, and are a reasonable starting point unless your requirements are unusual enough to need a custom design.
The Takeaway
Prompt caching and semantic caching both save money, but they are not the same lever, and confusing them leads to either disappointing savings or a cache that occasionally hands out the wrong answer with total confidence.
Prompt caching is close to a free win: reorder your prompt so static content comes first, enable it on whichever provider you use, and watch the input cost on repeated prefixes drop, particularly if you are running agent loops or long conversations. There is very little downside and very little judgment required.
Semantic caching is a genuinely powerful cost lever, but it is also a system you have to design and monitor, not a checkbox. It trades a small, ongoing risk of returning a near-miss answer for a real reduction in both cost and latency, and that trade is only worth making once you have scoped it correctly: the right threshold, the right invalidation strategy, and cache keys that respect user boundaries.
Used together, layered from the cheapest check to the most expensive, they compound. Used carelessly, one of them will eventually cost you more than the API bill it was supposed to shrink.
Building production AI systems? I write regularly about applied AI engineering, system architecture, and the real lessons from production deployments. Find me on LinkedIn or reach out directly at ciao@pavlo.sh.