How Anthropic Reduced Retrieval Failures 67% with Two-Stage Ranking
TRIGGER
Initial retrieval from large knowledge bases returns many chunks of varying relevance—potentially hundreds—and passing all of them to the model increases cost, latency, and risks distracting the model with marginally relevant content.
APPROACH
Anthropic added a reranking step after initial retrieval: (1) Retrieve top-150 potentially relevant chunks via embedding + BM25, (2) Pass all 150 chunks plus the user query through a reranking model (Cohere reranker), (3) Score each chunk on relevance and select top-20, (4) Pass only the top-20 to the generative model. Combined with contextual embeddings and contextual BM25, reranking achieved a 67% reduction in retrieval failure rate (5.7% → 1.9%).
PATTERN
“Retrieve 150 candidates cheap, rerank to top 20 with a model that actually reads query and chunk together. Two stages, two optimizations: recall first, then precision. Embeddings find related; rerankers find relevant.”
✓ WORKS WHEN
- Initial retrieval returns more candidates than you want to pass to the generative model (e.g., 150 → 20)
- Latency budget allows an additional model inference step (reranker scores chunks in parallel)
- Quality difference between top-20 and top-150 chunks is significant for your query distribution
- Reranking model cost is justified by improved response quality or reduced generative model costs
✗ FAILS WHEN
- Initial retrieval already returns high-precision results (reranking adds latency without quality gain)
- Latency requirements are extremely tight and cannot accommodate the reranking step
- Knowledge base is small enough that initial retrieval rarely includes irrelevant chunks
- Cost of reranking exceeds savings from passing fewer tokens to the generative model