A measurement is useful when it would change a decision. Most hallucination metrics in the wild fail this test: they wobble too much to gate a deploy, and they correlate weakly with what users complain about. This post walks through the rubric we use on every 47hq engagement.
The shape of a useful metric
We want a number that (a) moves when the system gets meaningfully worse, (b) doesn't move when prompts change cosmetically, and (c) maps to user-facing failure modes. That last criterion kills 80% of off-the-shelf benchmarks.
# rubric.py
RUBRIC = """
Score the assistant's answer on a 0–3 scale:
3 — fully grounded in retrieved context, no unsupported claims
2 — mostly grounded, one inconsequential gap
1 — at least one load-bearing claim is unsupported
0 — answer contradicts the context
Return JSON: {"score": int, "unsupported_claims": [str]}
"""The "unsupported_claims" array is the load-bearing piece. It gives you a corpus of failure examples you can re-sample as your golden set.
What we run in CI
Every PR runs the suite at concurrency 8 against a 500-query golden set. If hallucination_rate >3% the merge is blocked. The gate is cheap to run and has caught two regressions in the last quarter that the team's eyeball reviews missed.