Hybrid BM25 + dense retrieval that actually helps

Why parent-child chunking made the biggest difference for our docs RAG, and the cross-encoder reranker tuning that followed.

When I first wired up dense-vector search for our docs RAG, the metrics looked great in the eval set and the answers looked wrong in production. The usual story.

What moved the needle

Parent-child chunking. We embed small chunks (for recall) but return the parent section (for synthesis). One line of JSON per parent ID in the metadata and the LLM stopped losing context.
BM25 in the mix. Dense retrieval is great for paraphrase, terrible for proper nouns and version strings. A 50/50 blend of BM25 and dense, reranked with a cross-encoder, beat either alone.
Rerank last, and cheaply. A small cross-encoder (bge-reranker-base) over the top 20 candidates is ~30ms on CPU and lifted our top-3 accuracy by ~11pp.

What didn't

Fancy embedding fine-tuning. Not worth the ceremony at our corpus size.
Query rewriting with the big model. Helps sometimes, costs always.

Config sketch

const results = await chroma.query({
  queryTexts: [question],
  nResults: 40,
  where: { tenant: userTenant },
});
 
const reranked = await reranker.score(question, results.documents);
return reranked.slice(0, 5).map(toParentChunk);

Nothing surprising — just the obvious thing, done carefully.