Series B · Support automation SaaS · 140 employees
Helio Support
Reduced hallucination rate from 14% to 1.8% on a 500-query golden dataset in 5 weeks.
Helio's customer-facing answer agent was hallucinating product details on 14% of queries (LLM-as-judge, sampled weekly). Their CSAT had dropped from 4.6 to 4.1 over two quarters and support leadership was about to disable the AI surface entirely.
Their existing evaluation was 60 hand-labeled prompts a release manager ran by eye. There was no CI gate, no golden dataset, and no per-component score — they could not tell whether the retrieval, the prompt, or the model upgrade was the regression source.
- Built a 500-query golden dataset stratified by intent and difficulty, sourced from anonymized production traffic with their data team.
- Replaced their flat chunking with section-aware chunking + 300-token overlap. Benchmarked three strategies against the golden set; the winner improved retrieval precision@5 from 0.71 to 0.91.
- Added a Cohere reranker stage and a citation-grounding check that drops claims not supported by retrieved context.
- Wired our internal eval suite into their CI. Every PR runs it; hallucination_rate > 3% blocks merge.
| Metric | Before | After |
|---|---|---|
| Hallucination rate | 14% | 1.8% |
| Retrieval precision @5 | 0.71 | 0.91 |
| Citation integrity | n/a | 0.97 |
| P95 latency | 1.4s | 0.64s |
All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.
"We had three other agencies pitch us. 47hq was the only one that showed us their eval methodology before they showed us a deck. After the engagement we re-ran the suite ourselves — the numbers held."