Work · 3 of 12+ published

Production AI engagements, in detail.

Each case study below is detailed enough to serve as a sales-engineering deliverable on its own. Named customers, named metrics, stated methodology. No anonymous studies.

Series B · Support automation SaaS · 140 employees

Helio Support

Reduced hallucination rate from 14% to 1.8% on a 500-query golden dataset in 5 weeks.

helio-support · prod

STATIC

engagement · outcomes

Measured before / after

shipped

Hallucination rate

before

14%

after

1.8%

Retrieval precision @5

before

0.71

after

0.91

Citation integrity

before

n/a

after

0.97

Helio Support · outcomes

47hq

The problem

Helio's customer-facing answer agent was hallucinating product details on 14% of queries (LLM-as-judge, sampled weekly). Their CSAT had dropped from 4.6 to 4.1 over two quarters and support leadership was about to disable the AI surface entirely.

Their existing evaluation was 60 hand-labeled prompts a release manager ran by eye. There was no CI gate, no golden dataset, and no per-component score — they could not tell whether the retrieval, the prompt, or the model upgrade was the regression source.

What we did

Built a 500-query golden dataset stratified by intent and difficulty, sourced from anonymized production traffic with their data team.
Replaced their flat chunking with section-aware chunking + 300-token overlap. Benchmarked three strategies against the golden set; the winner improved retrieval precision@5 from 0.71 to 0.91.
Added a Cohere reranker stage and a citation-grounding check that drops claims not supported by retrieved context.
Wired our internal eval suite into their CI. Every PR runs it; hallucination_rate > 3% blocks merge.

Outcome

Metric	Before	After
Hallucination rate	14%	1.8%
Retrieval precision @5	0.71	0.91
Citation integrity	n/a	0.97
P95 latency	1.4s	0.64s

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"We had three other agencies pitch us. 47hq was the only one that showed us their eval methodology before they showed us a deck. After the engagement we re-ran the suite ourselves — the numbers held."
— Director of Engineering, Helio Support

Timeline · 5 weeksTeam · 1 senior + 1 staffScope · Fixed scope

Series A · DevOps copilot · 60 employees

Lattice Field

Shipped a CLI copilot with sub-1s P95 and an eval suite that gates every release.

lattice-field · prod

STATIC

engagement · outcomes

Measured before / after

shipped

Invalid commands

before

7.2%

after

0.6%

P95 end-to-end

before

1.8s

after

0.92s

Eval coverage

before

after

1,200

Lattice Field · outcomes

47hq

The problem

Lattice was about to ship a CLI copilot that suggests shell commands inside their DevOps platform. Internal dogfood revealed unpredictable latency and ~7% of suggestions were syntactically invalid for the user's shell. They needed reliability discipline before exposing it to paying customers.

No formal evals existed. The team was running by hand against ~30 prompts and arguing about whether a model swap had regressed things.

What we did

Designed a multi-stage retrieval architecture: per-tenant command history embedded in pgvector, with a small in-context fallback ruleset for shell-syntax validation.
Wrote an LLM-as-judge rubric specific to shell-command correctness and built a 1,200-query golden set covering bash, zsh, and PowerShell.
Introduced prompt + retrieval versioning with a one-click rollback path. Shipped P95 dashboards in Grafana.

Outcome

Metric	Before	After
Invalid commands	7.2%	0.6%
P95 end-to-end	1.8s	0.92s
Eval coverage	30	1,200

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"47hq treats reliability as engineering, not a deck. The runbooks they left us are still what on-call uses."
— CTO, Lattice Field

Timeline · 9 weeksTeam · 1 staff + 1 seniorScope · Fixed scope

Series B · Clinical knowledge platform · 220 employees

Arden Health

Rebuilt clinical-RAG pipeline with citation-first answers and an audited eval methodology.

arden-health · prod

STATIC

engagement · outcomes

Measured before / after

shipped

Citation integrity

before

0.62

after

0.98

Audited answer rate

before

n/a

after

100%

Refusal precision

before

n/a

after

0.94

Arden Health · outcomes

47hq

The problem

Arden's clinical knowledge surface was answering provider questions, but every answer had to be auditable — which prior art it pulled from, with what confidence, and whether the citation was load-bearing or decorative. Their existing system returned answers without verifiable grounding.

Compliance was blocking expansion to two health systems pending a documented eval methodology.

What we did

Designed a citation-first answer schema: model must return claim → supporting span → document ID for every assertion, or refuse.
Built a 2,000-query golden set with clinician review of ground-truth spans.
Integrated our citation_integrity scoring into a weekly compliance report.

Outcome

Metric	Before	After
Citation integrity	0.62	0.98
Audited answer rate	n/a	100%
Refusal precision	n/a	0.94

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"The compliance report 47hq designed is what unblocked our second health-system rollout."
— VP Engineering, Arden Health

Timeline · 7 weeksTeam · 2 seniorScope · Fixed scope

FAQ

How engagements actually run.

How long does a typical engagement take?: Most run 4–10 weeks end-to-end. RAG and eval work tends to land in 4–6 weeks; agent or pipeline rebuilds in 6–10. Anything longer gets broken into sequential fixed-scope phases, never an open retainer.
What does 'fixed scope, named metrics' mean in practice?: Before we start we agree on a written scope, a fixed price, and 3–5 measurable outcomes (e.g. retrieval MRR ≥ 0.7, p95 latency under 800ms, eval pass rate ≥ 90%). The engagement ships when those metrics are hit — not when a calendar runs out.
Who owns the code when you're done?: Your team owns everything from day one. Code lives in your repos, infra runs on your accounts, secrets in your vault. We don't host anything for you and we don't sell licenses.
Do you do retainers or hourly work?: No. Every engagement is fixed scope with a fixed price. After ship we offer an optional 30-day reliability watch — also fixed price — for monitoring SLOs and on-call handoff. Anything beyond that is a new scoped phase.
What's the typical team size on your side?: One named senior engineer end-to-end on most engagements. A second engineer joins for code review and the reliability handoff. No project managers, no account executives, no offshore subcontractors — ever.
Can you sign an NDA before we talk?: Yes. We'll countersign your standard mutual NDA before the consultation. We don't share client names, architectures, or numbers without written permission — the case studies on this page are published with consent.

Next step

See where your AI feature stands.

Our paid consultation produces the same diagnostic format you just read — for your system.

Book a Discovery Call See all engagements →

Refundable if we're not a fitWritten diagnostic in 48 hoursSession run by a founder, not a sales rep