LLM Evaluation: Why Your Demo Works but Production Fails

Almost every LLM application follows the same arc.

You build a prototype in an afternoon. You try ten or fifteen prompts. The answers look great. You show a demo to your team, everyone is impressed, and the project gets a green light.

Then it ships, real users arrive, and the complaints start. Answers that are subtly wrong. Confident hallucinations. Responses that ignore half the question. Behavior that changes the moment you tweak a prompt or upgrade the model.

Nothing in your code broke. What broke is the assumption that "the answers look great" is a measurement.

It is not. It is a vibe.

This article is about the discipline that separates LLM demos from LLM products: evaluation. How to measure whether your system actually works, how to keep it working as you change it, and how to catch problems before your users do.

Why the Demo-to-Production Gap Exists

Traditional software is deterministic. The same input produces the same output. You write a test, it passes or it fails, and a green test suite means the behavior is locked in.

LLM applications break all three of those guarantees.

Ask the same model the same question twice and you get different phrasings, different structure, sometimes different conclusions. There is rarely a single correct answer, so "passing" is a matter of degree, not a boolean. And the input space is effectively infinite, because users phrase things in ways you never imagined.

The result is a gap that surprises almost every team:

Your handful of demo prompts are clean, well-formed, and representative of nothing.
Real users send typos, half-sentences, multi-part questions, and queries that have no good answer at all.
A prompt change that fixes one case silently breaks three others you forgot to check.

The failure is almost never the model itself. It is the absence of an evaluation harness that would have told you the system was fragile before you found out the expensive way. This is the same lesson that shows up in RAG in production and in the way longer context windows quietly degrade accuracy: the architecture looks fine, and the behavior is what fails.

What "Evaluation" Actually Means

LLM evaluation is not a single thing. It splits into two distinct activities that solve different problems, and you need both.

Offline evaluation runs against a fixed dataset of inputs you control. You run it before you ship, on every change, and it answers one question: did this change make the system better or worse? Offline evals are your regression suite. They are fast, repeatable, and cheap enough to run on every commit.

Online evaluation runs against live production traffic. It samples real requests and scores them in flight. It answers a different question: is the system still behaving well right now, with the queries users are actually sending? Online evals catch drift, abuse, cost spikes, and the long tail of inputs your offline set never imagined.

Teams that only do offline evaluation ship confidently and then get blindsided by production. Teams that only do online evaluation find out about regressions after their users do. You want offline evals to gate changes and online evals to watch reality.

Step 1: Build an Eval Dataset Before Anything Else

You cannot measure what you have not written down. The foundation of every serious LLM evaluation setup is a dataset of inputs paired with some notion of a good outcome. People call this a golden set, an eval set, or a test set. The name does not matter. Having one does.

A useful eval dataset has three properties.

It is representative. The inputs should mirror what real users actually send, not what you wish they sent. The fastest way to build a representative set is to mine real traffic. Pull logs, cluster the queries, and sample across the clusters so that common cases and rare cases are both present.

It covers the failure modes, not just the happy path. Half the value of an eval set is in the awkward cases: ambiguous questions, out-of-scope requests, adversarial phrasing, queries with no correct answer, and inputs in the wrong language. If your set only contains questions you already know the system answers well, your score will be high and meaningless.

It is labeled with intent. For each input, capture what a good response looks like. Sometimes that is an exact answer. More often it is a rubric: the response must cite a source, must not invent figures, must stay under a length limit, must refuse politely when out of scope.

A common question is how big the set needs to be. You do not need thousands of examples to start. A focused set of fifty to a few hundred well-chosen cases will catch the majority of regressions. What matters far more than raw size is coverage of the cases that actually break. Grow the set over time by feeding every production failure back into it. That feedback loop, where failing traces become permanent test cases, is the single highest-leverage habit in applied LLM engineering.

Step 2: Choose Metrics That Reflect the Real Task

Generic metrics are seductive because they are easy to compute and easy to ignore. The metrics that matter are the ones tied to what your application is actually for.

That said, a few categories show up across almost every LLM application:

Dimension	Question it answers
Correctness	Is the answer factually right for this input?
Faithfulness	Is every claim grounded in the provided context, with nothing invented?
Relevance	Does the response actually address what was asked?
Completeness	Does it cover the full question, not just the easy part?
Format	Does it follow the required structure, length, and tone?
Safety	Does it refuse out-of-scope, unsafe, or unanswerable requests?

For retrieval-augmented systems, faithfulness is the metric that protects you from your scariest failure mode: a fluent, confident answer that is simply not supported by the documents you retrieved. For a structured-output feature, format validity may be the only metric that matters, and you can check it with a plain schema validator instead of anything fancy.

The principle is to start from the question "what does failure look like for my users?" and work backward to a metric, rather than reaching for a standard metric and hoping it is relevant.

Wherever a deterministic check is possible, use it. Exact match, regex, JSON schema validation, numeric tolerance, and unit-test-style assertions are fast, free, and never disagree with themselves. Reserve the expensive, fuzzy judgment for the cases that genuinely require it.

Step 3: LLM-as-a-Judge, Done Without Fooling Yourself

Many of the things you care about, like faithfulness, helpfulness, and tone, cannot be checked with a regex. The practical answer is to use a capable model to grade the output. This is LLM-as-a-judge, and it has become the default way to evaluate open-ended responses at scale.

It works. It also fails in specific, well-documented ways, and if you ignore those you will ship a judge that lies to you with a straight face.

Write a rubric, not a vibe

A vague instruction like "rate this answer from 1 to 10" produces noisy, uncalibrated scores. A judge needs the same thing a human grader needs: a concrete rubric that defines what each score means, with examples. Ask for a short justification before the score, give it the input, the response, and the reference or context, and constrain the output to a small, well-defined scale. Binary or three-point scales are far more consistent than a 1 to 10 range where nobody can explain the difference between a 6 and a 7.

Know the biases

LLM judges are not neutral. Research has documented several systematic biases you have to design around:

Position bias. In pairwise comparisons, judges favor whichever response came first or second regardless of quality. The standard mitigation is to run each comparison twice with the order swapped and only trust a result that survives the swap.
Length bias. Judges tend to reward longer, more verbose answers even when a short answer is better.
Self-preference and family bias. A judge tends to rate outputs from its own model family more favorably.

Pairwise comparison ("is A better than B?") is more stable for ranking two systems than asking for an absolute score, but it amplifies these biases more aggressively. Pointwise scoring is noisier per example but flips its verdict far less often when you re-run it. Pick the protocol that matches your question, and never assume the judge is objective.

Calibrate against humans

A judge you have not validated is a random number generator with good grammar. Before you trust it, have humans label a sample of the same examples and measure agreement between the judge and the humans. If they disagree often, your rubric is the problem, not the humans. Fix the rubric and re-measure. A reasonable rule of thumb is to keep a human-labeled set of at least a hundred examples per rubric, and to recalibrate whenever you change the judge model or the prompt.

Step 4: Make Evaluation Part of Development, Not a One-Time Event

The teams that ship reliable LLM products treat evals the way good engineering teams treat tests: as something that runs automatically, gates changes, and is never optional.

A practical setup looks like this:

On every commit: a lightweight, fast eval over a small core slice of the dataset, using deterministic checks where possible. This catches obvious regressions in seconds.
On merge: the full offline eval over the complete dataset, including the LLM-as-a-judge metrics. A change that meaningfully degrades a key metric should require explicit justification, exactly like a failing test would.
On a schedule in production: online evaluation over sampled live traffic, with alerts when a metric drifts.

This is sometimes called eval-driven development, and the mindset shift is the important part. A prompt change is no longer "I tried it a few times and it seemed better." It is "the eval score went from 0.78 to 0.84 across two hundred cases, with no regressions in the safety slice." One of those statements is defensible. The other is a guess.

Here is the shape of a minimal offline eval loop, stripped to the essentials:

def run_eval(system, dataset, judge):
    results = []
    for case in dataset:
        output = system(case.input)

        # Cheap deterministic checks first
        passed_format = validate_schema(output, case.schema)

        # Expensive judgment only where needed
        scores = judge.score(
            question=case.input,
            answer=output,
            reference=case.reference,
            rubric=case.rubric,
        )

        results.append({
            "id": case.id,
            "format_ok": passed_format,
            **scores,
        })
    return aggregate(results)

The structure matters more than the framework. Run the system over a fixed dataset, apply cheap checks before expensive ones, aggregate, and compare against the previous run. Whether you build this yourself or adopt one of the many eval frameworks available is a secondary decision.

Step 5: Watch Production, Because Reality Drifts

Passing your offline evals does not mean you are done. It means you are ready to ship. Production behavior diverges from offline behavior for reasons that have nothing to do with your code:

Users send queries you never anticipated.
The distribution of requests shifts over time as your product and audience change.
An upgraded model provider quietly changes behavior under you.
Your document corpus grows and retrieval quality decays, the slow failure described in the RAG article.

Online evaluation handles this by sampling a fraction of live traffic and scoring it with the same judges and checks you use offline. You do not need to evaluate every request. A sampled stream is enough to surface drift, and it keeps the cost bounded.

Pair the automated scoring with the cheapest signal of all: real user feedback. Thumbs up and down, regeneration clicks, and abandonment are all weak but honest indicators. The traces your users flag as bad are the richest possible source of new eval cases. Feed them back into the offline dataset, and the system that found a failure in production becomes immune to it in the next release.

A Word on Cost

LLM-as-a-judge is itself an LLM call, which means evaluation has a bill. It is easy to let the judge cost balloon, especially in online evaluation over high-traffic systems.

A few practical guardrails keep it sane. Use deterministic checks for everything that does not strictly require judgment, so you only pay for the hard cases. Sample production traffic rather than scoring every request, because the marginal information from a higher sample rate falls off quickly. And when judge cost starts to rival a meaningful share of your production inference cost, switch to a smaller, distilled judge model for the routine cases and reserve the expensive judge for spot checks and calibration. Understanding how this all maps back to tokens is the foundation, and it is worth knowing why tokens drive nearly every cost and limit in an LLM system.

Common Pitfalls

A short list of the mistakes that quietly ruin evaluation efforts:

Eval set contamination. If your evaluation examples have leaked into the model's training data, or into your own prompt examples, your scores are inflated and meaningless.
Testing only the happy path. A set with no hard cases produces a high score and zero insight.
Trusting an uncalibrated judge. A judge you have not checked against humans is not a measurement.
One score to rule them all. Collapsing everything into a single average hides the regression that only affects the safety slice. Track metrics per category and per slice.
Evaluating once. A single pre-launch eval is a snapshot. Quality is a moving target, and only a recurring eval keeps up with it.

A Production Evaluation Checklist

Before you call an LLM application production-ready:

Dataset:

[ ] Eval set built from representative real traffic
[ ] Hard cases, edge cases, and out-of-scope inputs included
[ ] Each case labeled with an expected outcome or rubric
[ ] A feedback loop that turns production failures into new cases

Metrics:

[ ] Metrics tied to actual task failure modes, not generic defaults
[ ] Deterministic checks used wherever possible
[ ] Faithfulness or grounding measured for any retrieval-based feature
[ ] Results tracked per category and per slice, not as one average

Judge:

[ ] Clear rubric with defined score levels and examples
[ ] Position bias mitigated by swapping order in pairwise comparisons
[ ] Judge calibrated against at least 100 human-labeled examples
[ ] Recalibration scheduled when the judge model or prompt changes

Process:

[ ] Lightweight eval on every commit
[ ] Full offline eval gating every merge
[ ] Online evaluation sampling live production traffic
[ ] Drift and cost alerts wired up

Frequently Asked Questions

What is LLM evaluation?

LLM evaluation is the practice of measuring whether a large language model application produces correct, grounded, relevant, and safe outputs, both before release through offline tests on a fixed dataset and after release through online monitoring of live traffic. It replaces informal "the answers look good" judgment with repeatable measurement.

What is the difference between offline and online LLM evaluation?

Offline evaluation runs against a fixed, curated dataset and is used to gate changes and catch regressions before shipping. Online evaluation runs against sampled live production traffic and is used to detect drift, abuse, and cost issues in real time. Production-grade systems use both.

How many examples do I need in an eval dataset?

You can start with fifty to a few hundred well-chosen examples that cover both common cases and known failure modes. Coverage of the cases that actually break matters far more than raw size. Grow the set over time by adding every production failure back into it.

Is LLM-as-a-judge reliable?

It is reliable enough to be useful at scale, but only if you control for its known biases (position, length, and self-preference), give it a concrete rubric instead of a vague scale, and calibrate it against human labels before trusting its scores. An unvalidated judge is not a measurement.

The Right Mental Model

The teams that struggle with LLM products believe evaluation is something you do at the end, if there is time. The teams that ship reliable ones build the eval harness first and treat its scores as the source of truth for every decision.

A demo proves that your system can produce a good answer. An evaluation proves that it reliably does, across the messy, unpredictable inputs of real users, and keeps doing it after every change you make.

The model is rarely the thing that fails in production. What fails is the absence of a measurement that would have told you the truth earlier. Build that measurement first, and most of the production surprises disappear.

Building production AI systems? I write regularly about applied AI engineering, system architecture, and the real lessons from production deployments. Find me on LinkedIn or reach out directly at ciao@pavlo.sh.