Measuring hallucination without a vendor

By Usman CheemaApril 12, 20269 min#evals #RAG

A measurement is useful when it would change a decision. Most hallucination metrics in the wild fail this test: they wobble too much to gate a deploy, and they correlate weakly with what users complain about. This post walks through the rubric we use on every 47hq engagement.

The shape of a useful metric

We want a number that (a) moves when the system gets meaningfully worse, (b) doesn't move when prompts change cosmetically, and (c) maps to user-facing failure modes. That last criterion kills 80% of off-the-shelf benchmarks.

# rubric.py
RUBRIC = """
Score the assistant's answer on a 0–3 scale:

  3 — fully grounded in retrieved context, no unsupported claims
  2 — mostly grounded, one inconsequential gap
  1 — at least one load-bearing claim is unsupported
  0 — answer contradicts the context

Return JSON: {"score": int, "unsupported_claims": [str]}
"""

The "unsupported_claims" array is the load-bearing piece. It gives you a corpus of failure examples you can re-sample as your golden set.

What we run in CI

Every PR runs the suite at concurrency 8 against a 500-query golden set. If hallucination_rate >3% the merge is blocked. The gate is cheap to run and has caught two regressions in the last quarter that the team's eyeball reviews missed.

The 11% — what mature evals actually look like →

Measuring hallucination without a vendor

The shape of a useful metric

What we run in CI

Want this discipline shipped into your stack?