← Back to patterns
build

How Anthropic Reduced Context Pollution with Code-Routed Tool Results

TRIGGER

Multi-step agent workflows were polluting context with intermediate results the model didn't need—fetching 2,000+ expense line items to answer 'who exceeded budget?' meant all raw data entered context even though only 2-3 names mattered for the final answer. Each tool call also required a full inference pass, compounding latency.

APPROACH

Anthropic implemented Programmatic Tool Calling where Claude writes Python orchestration code instead of requesting tools individually. Tools marked with `allowed_callers: ['code_execution']` execute in a sandboxed environment; their results go to the script, not Claude's context. Only the script's final output (stdout) enters context. Input: Claude generates code like `expenses = await asyncio.gather(*[get_expenses(m['id']) for m in team])` with filtering logic. Output: just the computed result (e.g., 1KB of budget violations instead of 200KB of raw expense data). Results: 37% token reduction (43,588 → 27,297 on complex research tasks); knowledge retrieval improved 25.6% → 28.5%; GIA benchmarks improved 46.5% → 51.2%.

PATTERN

200KB of expense records entering context when you only need three names. Route tool results through code that filters to conclusions. The model should see conclusions, not evidence.

WORKS WHEN

  • Processing datasets where only aggregates or summaries matter (not raw records)
  • Multi-step workflows with 3+ dependent tool calls
  • Intermediate results shouldn't influence reasoning (e.g., raw logs, bulk records)
  • Operations can run in parallel across many items (checking 50 endpoints, fetching N user records)
  • Tool outputs are large but final answer is small (200KB → 1KB reduction pattern)

FAILS WHEN

  • Claude should see and reason about all intermediate results (debugging, auditing)
  • Simple single-tool invocations where code overhead exceeds benefit
  • Quick lookups with small responses (<1K tokens)
  • Tool results require subjective interpretation rather than programmatic filtering
  • Orchestration logic is too complex to express reliably in generated code

Stage

build

From

November 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.