build

Why Anthropic Combines BM25 with Semantic Search

TRIGGER

Semantic embedding search was missing relevant chunks when queries contained unique identifiers, technical terms, or exact phrases—a query for 'Error code TS-999' would find content about error codes in general but miss the exact documentation for TS-999.

APPROACH

Anthropic combined BM25 lexical matching with semantic embeddings in a hybrid retrieval system. Input: user query. Output: ranked chunks combining exact term matches with semantic similarity. For each query: (1) Use BM25 to find top chunks based on exact term matches, (2) Use embeddings to find top chunks based on semantic similarity, (3) Combine and deduplicate results using rank fusion techniques, (4) Pass top-K chunks to the model. When combined with contextual preprocessing, this hybrid approach reduced retrieval failure rate by 49% (5.7% → 2.9%) compared to embeddings alone.

PATTERN

“Embeddings encode meaning, not strings—"TS-999" becomes "error code in general" not "the specific TS-999 doc." BM25 catches exact matches embeddings miss; combining them cut retrieval failures by half.”

✓ WORKS WHEN

Knowledge base contains technical identifiers, codes, or proper nouns that must match exactly
Query distribution includes both precise lookup queries and conceptual questions
You can afford to maintain two indexes (embedding vectors + BM25 term index)
Retrieval latency budget allows for querying both indexes and rank fusion

✗ FAILS WHEN

Queries are exclusively conceptual with no exact-match requirements (pure semantic search suffices)
Knowledge base is highly standardized with no specialized terminology (BM25 adds little value)
Storage or maintenance overhead of dual indexes is prohibitive
Query terms frequently appear in irrelevant documents (BM25 will surface false positives)

Stage

build

Source

Anthropic Engineering →

From

September 2024