iterate

The Shared State Trap in Agent Evaluations

TRIGGER

Teams running multi-trial agent evaluations get unreliable results—correlated failures from shared infrastructure issues inflate failure rates, while agents gaining unfair advantages from previous trial artifacts inflate success rates—making it impossible to distinguish agent capability from environmental noise.

APPROACH

Anthropic's internal eval harness starts each trial from a clean environment to eliminate shared state between runs. They discovered specific contamination patterns: leftover files and cached data causing correlated failures, resource exhaustion (like limited CPU memory) affecting multiple trials identically, and in one case, Claude gaining unfair advantage by examining git history from previous trials. The harness isolates each trial so failures are independent—if they're not independent (same environmental factor affecting multiple trials), the eval results 'become unreliable for measuring agent performance.'

PATTERN

“Correlated failures that inflate your error rate—the "shared state trap" where leftover files, cached data, or even git history from previous trials contaminates results. One trial, Claude gained unfair advantage by reading git logs from earlier runs. When trials aren't independent, your metrics measure environment noise, not agent capability.”

✓ WORKS WHEN

Running multiple trials per task to handle model non-determinism
Agent can modify environment state (files, databases, git history)
Eval harness runs trials in sequence on shared infrastructure
Statistical reliability matters for decision-making

✗ FAILS WHEN

Single-trial evaluation where independence is irrelevant
Agent is purely stateless (no file system or environment access)
Isolation overhead exceeds evaluation budget
You're deliberately testing multi-agent collaboration with shared state

Stage

iterate

Source

Anthropic Engineering →

From

January 2026