04 · AI Implementation

RAG & Embedding

Production-grade retrieval — measured against your golden set before it ships.

Book a Discovery Call See deliverables

rag-index · prod

LIVE

query > duplicate charge auto refund window

handbook/refund-policy.md0.94

…refund within 4 minutes…

ledger/transactions.sql0.88

…auto-reversal trigger…

runbooks/billing.md0.81

…duplicate retry guard…

Recall @5

0.96

↑ 0.18

MRR

0.71

↑ 22%

Chunk size

512

tuned

Retrieval you can measure.

47hq

Duration

3–5 weeks

Team

1 principal + 1 engineer

Starts in

Kick-off within 1–2 weeks of SOW

Investment

Fixed fee · $45k–$85k

Overview

What you get

Chunking, hybrid retrieval, reranking, citation grounding, and CI eval gates so your RAG system improves with every deploy instead of silently regressing.

The problem

Why teams call us

Hallucination rate is unknown and getting worse with each prompt tweak.
Retrieval precision regresses silently — nobody finds out until a customer does.
Cost-per-query is creeping up and there's no budget instrumentation.

Approach

How we work

Start with the eval set, not the prompt. Measure first, ship second.
Three chunking strategies benchmarked head-to-head against your data.
Hybrid (BM25 + dense) retrieval with a reranker tuned to your domain.
CI gates so regressions on hallucination or precision block the deploy.

Process

Week by week.

01 · Week 1
Eval set
Build 500+ golden query/answer set with LLM-as-judge rubrics.
02 · Week 2
Retrieval rebuild
Chunking benchmark, hybrid retrieval, reranker.
03 · Week 3–4
Grounding
Citation system, refusal logic, before/after measurement.
04 · Week 5
CI + handoff
Eval CI gates, dashboards, your-team-can-run-this runbook.

You're a fit if

Live RAG system shipped to paying customers
Willingness to instrument production with our eval harness
Single decision-maker on your side

Probably not a fit if

Pre-product 'we want to do RAG' exploration
No tolerance for instrumenting production with eval traces
Looking for a chatbot wrapper, not a measured retrieval system

Deliverables

Everything we ship

01Chunking redesign with three strategies benchmarked
02Hybrid BM25 + dense retrieval with reranker
03Citation system with source-grounding checks
04Golden eval dataset of 500+ queries with LLM-as-judge scoring
05CI gates so regressions block deploys
06Cost-per-query + latency dashboards

Outcomes

What you walk away with.

−60%

hallucination rate vs baseline

+25%

retrieval precision @5

−35%

P95 latency and cost-per-query

Named before/after metrics on hallucination rate, retrieval precision @5, P95 latency, and cost-per-query — methodology your team can re-run forever.

Tooling

Stack we ship against

Model- and infra-agnostic. We adapt to your stack, not the other way around.

OpenAIAnthropicCohere RerankPineconepgvectorTurbopufferLangSmithLangfuse

FAQ

Real questions, technically answered.

Will this work with our existing vector DB?: Yes. We are infrastructure-agnostic and ship against Pinecone, Weaviate, pgvector, Turbopuffer, and Mongo Atlas.
Do we have to switch LLM providers?: No. We benchmark across providers and recommend, but the decision (and migration) is yours.
What if our golden set is too small?: We co-author it during week 1 using real production traces. 500 queries is the floor, not the ceiling.

Related engagements

Often paired with.

02 · AI Implementation

AI Copilots

In-product assistants grounded in your customers' data — that ship, not demo.

05 · AI Implementation

Fine-Tuning & Inference

Specialised models that beat general-purpose ones on accuracy and unit cost.

09 · AI Infrastructure

Production Telemetry

When your AI gets worse, your team knows in minutes — not quarters.

Next step

Ready to scope RAG & Embedding?

Book a discovery call. We'll confirm fit, sequence the engagement, and have a Statement of Work in your inbox within a week.

Book a Discovery Call See all engagements →

Refundable if we're not a fitWritten diagnostic in 48 hoursSession run by a founder, not a sales rep