47HQ
← Writing

Measuring hallucination without a vendor

By Usman CheemaApril 12, 20269 min#evals #RAG

A measurement is useful when it would change a decision. Most hallucination metrics in the wild fail this test: they wobble too much to gate a deploy, and they correlate weakly with what users complain about. This post walks through the rubric we use on every 47hq engagement.

The shape of a useful metric

We want a number that (a) moves when the system gets meaningfully worse, (b) doesn't move when prompts change cosmetically, and (c) maps to user-facing failure modes. That last criterion kills 80% of off-the-shelf benchmarks.

# rubric.py
RUBRIC = """
Score the assistant's answer on a 0–3 scale:

  3 — fully grounded in retrieved context, no unsupported claims
  2 — mostly grounded, one inconsequential gap
  1 — at least one load-bearing claim is unsupported
  0 — answer contradicts the context

Return JSON: {"score": int, "unsupported_claims": [str]}
"""

The "unsupported_claims" array is the load-bearing piece. It gives you a corpus of failure examples you can re-sample as your golden set.

What we run in CI

Every PR runs the suite at concurrency 8 against a 500-query golden set. If hallucination_rate >3% the merge is blocked. The gate is cheap to run and has caught two regressions in the last quarter that the team's eyeball reviews missed.

Next step

Want this discipline shipped into your stack?

Our paid consultation produces a written 1-page diagnostic against your live AI feature in 48 hours.

Refundable if we're not a fitWritten diagnostic in 48 hoursSession run by a founder, not a sales rep