How Anthropic Reduced Context Pollution with Code-Routed Tool Results
TRIGGER
Multi-step agent workflows were polluting context with intermediate results the model didn't need—fetching 2,000+ expense line items to answer 'who exceeded budget?' meant all raw data entered context even though only 2-3 names mattered for the final answer. Each tool call also required a full inference pass, compounding latency.
APPROACH
Anthropic implemented Programmatic Tool Calling where Claude writes Python orchestration code instead of requesting tools individually. Tools marked with `allowed_callers: ['code_execution']` execute in a sandboxed environment; their results go to the script, not Claude's context. Only the script's final output (stdout) enters context. Input: Claude generates code like `expenses = await asyncio.gather(*[get_expenses(m['id']) for m in team])` with filtering logic. Output: just the computed result (e.g., 1KB of budget violations instead of 200KB of raw expense data). Results: 37% token reduction (43,588 → 27,297 on complex research tasks); knowledge retrieval improved 25.6% → 28.5%; GIA benchmarks improved 46.5% → 51.2%.
PATTERN
“200KB of expense records entering context when you only need three names. Route tool results through code that filters to conclusions. The model should see conclusions, not evidence.”
✓ WORKS WHEN
- Processing datasets where only aggregates or summaries matter (not raw records)
- Multi-step workflows with 3+ dependent tool calls
- Intermediate results shouldn't influence reasoning (e.g., raw logs, bulk records)
- Operations can run in parallel across many items (checking 50 endpoints, fetching N user records)
- Tool outputs are large but final answer is small (200KB → 1KB reduction pattern)
✗ FAILS WHEN
- Claude should see and reason about all intermediate results (debugging, auditing)
- Simple single-tool invocations where code overhead exceeds benefit
- Quick lookups with small responses (<1K tokens)
- Tool results require subjective interpretation rather than programmatic filtering
- Orchestration logic is too complex to express reliably in generated code