47HQ

Work · 3 of 12+ published

Production AI engagements, in detail.

Each case study below is detailed enough to serve as a sales-engineering deliverable on its own. Named customers, named metrics, stated methodology. No anonymous studies.

Series B · Support automation SaaS · 140 employees

Helio Support

Reduced hallucination rate from 14% to 1.8% on a 500-query golden dataset in 5 weeks.

helio-support · prod
STATIC
engagement · outcomes
Measured before / after
shipped
Hallucination rate
before
14%
after
1.8%
Retrieval precision @5
before
0.71
after
0.91
Citation integrity
before
n/a
after
0.97
Helio Support · outcomes
47hq
The problem

Helio's customer-facing answer agent was hallucinating product details on 14% of queries (LLM-as-judge, sampled weekly). Their CSAT had dropped from 4.6 to 4.1 over two quarters and support leadership was about to disable the AI surface entirely.

Their existing evaluation was 60 hand-labeled prompts a release manager ran by eye. There was no CI gate, no golden dataset, and no per-component score — they could not tell whether the retrieval, the prompt, or the model upgrade was the regression source.

What we did
  • Built a 500-query golden dataset stratified by intent and difficulty, sourced from anonymized production traffic with their data team.
  • Replaced their flat chunking with section-aware chunking + 300-token overlap. Benchmarked three strategies against the golden set; the winner improved retrieval precision@5 from 0.71 to 0.91.
  • Added a Cohere reranker stage and a citation-grounding check that drops claims not supported by retrieved context.
  • Wired our internal eval suite into their CI. Every PR runs it; hallucination_rate > 3% blocks merge.
Outcome
MetricBeforeAfter
Hallucination rate14%1.8%
Retrieval precision @50.710.91
Citation integrityn/a0.97
P95 latency1.4s0.64s

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"We had three other agencies pitch us. 47hq was the only one that showed us their eval methodology before they showed us a deck. After the engagement we re-ran the suite ourselves — the numbers held."

Director of Engineering, Helio Support
Timeline · 5 weeksTeam · 1 senior + 1 staffScope · Fixed scope

Series A · DevOps copilot · 60 employees

Lattice Field

Shipped a CLI copilot with sub-1s P95 and an eval suite that gates every release.

lattice-field · prod
STATIC
engagement · outcomes
Measured before / after
shipped
Invalid commands
before
7.2%
after
0.6%
P95 end-to-end
before
1.8s
after
0.92s
Eval coverage
before
30
after
1,200
Lattice Field · outcomes
47hq
The problem

Lattice was about to ship a CLI copilot that suggests shell commands inside their DevOps platform. Internal dogfood revealed unpredictable latency and ~7% of suggestions were syntactically invalid for the user's shell. They needed reliability discipline before exposing it to paying customers.

No formal evals existed. The team was running by hand against ~30 prompts and arguing about whether a model swap had regressed things.

What we did
  • Designed a multi-stage retrieval architecture: per-tenant command history embedded in pgvector, with a small in-context fallback ruleset for shell-syntax validation.
  • Wrote an LLM-as-judge rubric specific to shell-command correctness and built a 1,200-query golden set covering bash, zsh, and PowerShell.
  • Introduced prompt + retrieval versioning with a one-click rollback path. Shipped P95 dashboards in Grafana.
Outcome
MetricBeforeAfter
Invalid commands7.2%0.6%
P95 end-to-end1.8s0.92s
Eval coverage301,200

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"47hq treats reliability as engineering, not a deck. The runbooks they left us are still what on-call uses."

CTO, Lattice Field
Timeline · 9 weeksTeam · 1 staff + 1 seniorScope · Fixed scope

Series B · Clinical knowledge platform · 220 employees

Arden Health

Rebuilt clinical-RAG pipeline with citation-first answers and an audited eval methodology.

arden-health · prod
STATIC
engagement · outcomes
Measured before / after
shipped
Citation integrity
before
0.62
after
0.98
Audited answer rate
before
n/a
after
100%
Refusal precision
before
n/a
after
0.94
Arden Health · outcomes
47hq
The problem

Arden's clinical knowledge surface was answering provider questions, but every answer had to be auditable — which prior art it pulled from, with what confidence, and whether the citation was load-bearing or decorative. Their existing system returned answers without verifiable grounding.

Compliance was blocking expansion to two health systems pending a documented eval methodology.

What we did
  • Designed a citation-first answer schema: model must return claim → supporting span → document ID for every assertion, or refuse.
  • Built a 2,000-query golden set with clinician review of ground-truth spans.
  • Integrated our citation_integrity scoring into a weekly compliance report.
Outcome
MetricBeforeAfter
Citation integrity0.620.98
Audited answer raten/a100%
Refusal precisionn/a0.94

All metrics measured with our eval suite against the golden set defined in the engagement. Methodology in the handoff doc.

"The compliance report 47hq designed is what unblocked our second health-system rollout."

VP Engineering, Arden Health
Timeline · 7 weeksTeam · 2 seniorScope · Fixed scope

FAQ

How engagements actually run.

How long does a typical engagement take?
Most run 4–10 weeks end-to-end. RAG and eval work tends to land in 4–6 weeks; agent or pipeline rebuilds in 6–10. Anything longer gets broken into sequential fixed-scope phases, never an open retainer.
What does 'fixed scope, named metrics' mean in practice?
Before we start we agree on a written scope, a fixed price, and 3–5 measurable outcomes (e.g. retrieval MRR ≥ 0.7, p95 latency under 800ms, eval pass rate ≥ 90%). The engagement ships when those metrics are hit — not when a calendar runs out.
Who owns the code when you're done?
Your team owns everything from day one. Code lives in your repos, infra runs on your accounts, secrets in your vault. We don't host anything for you and we don't sell licenses.
Do you do retainers or hourly work?
No. Every engagement is fixed scope with a fixed price. After ship we offer an optional 30-day reliability watch — also fixed price — for monitoring SLOs and on-call handoff. Anything beyond that is a new scoped phase.
What's the typical team size on your side?
One named senior engineer end-to-end on most engagements. A second engineer joins for code review and the reliability handoff. No project managers, no account executives, no offshore subcontractors — ever.
Can you sign an NDA before we talk?
Yes. We'll countersign your standard mutual NDA before the consultation. We don't share client names, architectures, or numbers without written permission — the case studies on this page are published with consent.

Next step

See where your AI feature stands.

Our paid consultation produces the same diagnostic format you just read — for your system.

Refundable if we're not a fitWritten diagnostic in 48 hoursSession run by a founder, not a sales rep