iterate

The Eval Saturation Trap

TRIGGER

Teams building agent evaluation suites struggle with the lifecycle of their tests—capability evals that once measured 'can we do this at all?' become meaningless when pass rates approach 100%, while there's no systematic way to prevent backsliding on previously achieved capabilities.

APPROACH

Anthropic maintains two distinct eval types with different purposes. Capability evals start with low pass rates (targeting tasks the agent struggles with) and measure improvement—'what can this agent do well?' Regression evals have nearly 100% pass rates and detect breakage—'does the agent still handle tasks it used to?' When capability evals achieve high, stable pass rates, they 'graduate' to become regression tests run continuously in CI/CD. The team monitors for 'eval saturation' (like SWE-Bench Verified going from 30% to >80% in one year) as a signal to create harder capability evals while graduated tasks protect the baseline.

PATTERN

“95% pass rates that tell you nothing—the "eval saturation trap" where tests that once measured capability now waste CI time while regressions slip through unguarded. Evals have a lifecycle: born as aspirational tests, they graduate to regression guards when pass rates stabilize, then retire when saturated.”

✓ WORKS WHEN

Agent development spans months with incremental improvements
Model upgrades happen periodically and could regress previously-working behaviors
Team has CI/CD infrastructure to run regression suites on each change
Capability evals can be cleanly separated from regression protection

✗ FAILS WHEN

Agent is in early prototyping where everything changes rapidly
Pass rates fluctuate significantly between runs due to non-determinism rather than capability changes
Maintaining two separate suites exceeds team capacity
Eval infrastructure doesn't support partitioning suites by purpose

Stage

iterate

Source

Anthropic Engineering →

From

January 2026