build

Anthropic's Contextual Chunk Annotation Pattern

TRIGGER

RAG systems were failing to retrieve relevant information because chunks lost context when split from their source documents—a chunk saying 'revenue grew 3%' doesn't specify which company or time period, making it impossible to match against queries like 'What was ACME Corp's Q2 2023 revenue growth?'

APPROACH

Anthropic's team added a preprocessing step: before embedding each chunk, they passed the full document + chunk to Claude Haiku asking for 'short succinct context to situate this chunk within the overall document.' Input: full document + individual chunk. Output: chunk with 50-100 token context prepended (e.g., 'This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million.'). They applied this to both embedding and BM25 indexes, using prompt caching to amortize cost ($1.02 per million document tokens). Results: 35% reduction in top-20 retrieval failure rate with contextual embeddings alone (5.7% → 3.7%), 49% combined with contextual BM25 (5.7% → 2.9%), 67% with reranking added (5.7% → 1.9%).

PATTERN

“The chunk loses what "revenue grew 3%" actually refers to—company, quarter, context—during splitting, not retrieval. Prepending 50-100 tokens of source context at index time restored 35-67% of retrieval failures.”

✓ WORKS WHEN

Chunks frequently reference entities or timeframes defined elsewhere in the document (financial filings, technical documentation, legal contracts)
Knowledge base exceeds 200k tokens (below this threshold, include entire knowledge base in prompt instead)
Documents have coherent structure where surrounding context changes meaning of individual chunks
Prompt caching is available to amortize the cost of passing full documents repeatedly
Documents fit in context window for annotation (article used 8k token documents with 800 token chunks)

✗ FAILS WHEN

Chunks are already self-contained (FAQ entries, dictionary definitions, standalone articles with no cross-references)
Knowledge base is under 200k tokens—just include the entire knowledge base in the prompt with caching
Real-time indexing is required and LLM latency per chunk is unacceptable
Key terms and entities are defined in other documents rather than the source document being chunked
Documents lack coherent structure that would provide useful disambiguation context

Stage

build

Source

Anthropic Engineering →

From

September 2024