H
Honey Badger IT Limited
All posts
AI / MLEngineering

Why your RAG system needs an evaluation harness — not vibes

Most RAG systems die in production not because the model is bad, but because nobody can tell when it gets worse. Here's how to fix that.

April 8, 2026 · 12 min read

The problem with "looks good"

Most RAG demos go like this: the developer types in a question, the system returns an answer, the developer says "that looks good," and the project gets a thumbs-up.

Then it ships, the underlying model gets a quiet update, the chunking strategy drifts, somebody changes the embedding model, and three weeks later the support inbox fills up with "the bot is dumber now."

Nobody can prove it's worse. Nobody can prove it's the same. Nobody knows.

The fix is an evaluation harness. It's not glamorous. It's the difference between an AI feature that ships and one that lingers.

What an evaluation harness actually contains

A real eval harness has four layers:

1. A held-out test set

A hand-curated set of 100–500 representative questions, each with:

  • The question text
  • Acceptable answer(s) — sometimes one, sometimes a range
  • Required citations (which source documents must appear in retrieval)
  • Difficulty tier (easy / medium / hard)

This is the slow part. There's no shortcut. Senior domain experts have to write these questions, and they have to be honest about what "correct" means.

2. Automated metrics

For each test question, we compute:

  • Retrieval recall: did the right source documents make it into the top-K context?
  • Citation accuracy: does the answer cite a real source from the retrieved context?
  • Answer faithfulness: does the answer's claim match the cited source?
  • Answer correctness: an LLM-as-judge score against the reference answer.

The first three are deterministic. The fourth is squishy but useful in aggregate.

3. A regression dashboard

Every time we change anything — the chunking strategy, the embedding model, the prompt, the retrieval ranker — we run the full eval harness and compare to the last green run.

Anything that drops more than 2 percentage points on any metric is a regression. Regressions block merge.

4. A human review queue

The bottom 10% of automated scores get sent to a human reviewer weekly. They confirm the failure mode and add the example to the test set if it represents a new class of bug.

What this catches

In the past 12 months, our eval harness has caught:

  • A retrieval bug where a chunking change cut recall by 14 points (caught in CI)
  • An OpenAI model update that subtly changed answer style and broke citation parsing (caught in scheduled re-eval)
  • A prompt change that improved easy-tier scores but tanked hard-tier scores (caught in the difficulty breakdown)

Without the harness, all three would have shipped to production. With the harness, none did.

The shortcut that doesn't work

The shortcut is "use an LLM to grade an LLM." It works for ~70% of cases. The other 30% is where the real bugs hide. You still need humans on the difficult tier and you still need deterministic metrics on retrieval.

If you only do LLM-as-judge, you're back to vibes. The judges drift, the scores drift, you have a number that isn't a measurement.

What to ship first

If you're starting from zero on a RAG system:

  1. Hand-write 30 questions with reference answers. Just 30.
  2. Build the retrieval recall metric. Nothing else yet.
  3. Run it before any change merges.

That's enough to catch 60% of the regressions you'd otherwise ship. Add the rest as the system grows.

The point isn't to have a perfect harness. The point is to have a harness. Vibes don't survive contact with production.

Working on something similar?

We'd love to hear about it.