The 11% — what mature evals actually look like
We informally surveyed 84 mid-market SaaS teams shipping AI features in Q1 2026. Roughly 11% have evaluation discipline we'd call mature: golden datasets versioned in source control, CI gates, regression alerts, and a single owner.
The four things in common
1. The golden set lives in the repo. Not a Google Sheet, not a Notion doc. Pull requests against it are reviewed like code.
2. The eval runs on every PR. Not nightly, not on merge. The gate is part of the change.
3. One owner. Someone is on the hook for the methodology, including knowing when to retire metrics.
4. The judge is calibrated against humans. Quarterly a human re-labels a stratified sample and you measure judge agreement. If agreement drifts, the rubric is wrong.