The Shared State Trap in Agent Evaluations
TRIGGER
Teams running multi-trial agent evaluations get unreliable results—correlated failures from shared infrastructure issues inflate failure rates, while agents gaining unfair advantages from previous trial artifacts inflate success rates—making it impossible to distinguish agent capability from environmental noise.
APPROACH
Anthropic's internal eval harness starts each trial from a clean environment to eliminate shared state between runs. They discovered specific contamination patterns: leftover files and cached data causing correlated failures, resource exhaustion (like limited CPU memory) affecting multiple trials identically, and in one case, Claude gaining unfair advantage by examining git history from previous trials. The harness isolates each trial so failures are independent—if they're not independent (same environmental factor affecting multiple trials), the eval results 'become unreliable for measuring agent performance.'
PATTERN
“Correlated failures that inflate your error rate—the "shared state trap" where leftover files, cached data, or even git history from previous trials contaminates results. One trial, Claude gained unfair advantage by reading git logs from earlier runs. When trials aren't independent, your metrics measure environment noise, not agent capability.”
✓ WORKS WHEN
- Running multiple trials per task to handle model non-determinism
- Agent can modify environment state (files, databases, git history)
- Eval harness runs trials in sequence on shared infrastructure
- Statistical reliability matters for decision-making
✗ FAILS WHEN
- Single-trial evaluation where independence is irrelevant
- Agent is purely stateless (no file system or environment access)
- Isolation overhead exceeds evaluation budget
- You're deliberately testing multi-agent collaboration with shared state