← Back to patterns
iterate

The Eval Saturation Trap

TRIGGER

Teams building agent evaluation suites struggle with the lifecycle of their tests—capability evals that once measured 'can we do this at all?' become meaningless when pass rates approach 100%, while there's no systematic way to prevent backsliding on previously achieved capabilities.

APPROACH

Anthropic maintains two distinct eval types with different purposes. Capability evals start with low pass rates (targeting tasks the agent struggles with) and measure improvement—'what can this agent do well?' Regression evals have nearly 100% pass rates and detect breakage—'does the agent still handle tasks it used to?' When capability evals achieve high, stable pass rates, they 'graduate' to become regression tests run continuously in CI/CD. The team monitors for 'eval saturation' (like SWE-Bench Verified going from 30% to >80% in one year) as a signal to create harder capability evals while graduated tasks protect the baseline.

PATTERN

95% pass rates that tell you nothing—the "eval saturation trap" where tests that once measured capability now waste CI time while regressions slip through unguarded. Evals have a lifecycle: born as aspirational tests, they graduate to regression guards when pass rates stabilize, then retire when saturated.

WORKS WHEN

  • Agent development spans months with incremental improvements
  • Model upgrades happen periodically and could regress previously-working behaviors
  • Team has CI/CD infrastructure to run regression suites on each change
  • Capability evals can be cleanly separated from regression protection

FAILS WHEN

  • Agent is in early prototyping where everything changes rapidly
  • Pass rates fluctuate significantly between runs due to non-determinism rather than capability changes
  • Maintaining two separate suites exceeds team capacity
  • Eval infrastructure doesn't support partitioning suites by purpose

Stage

iterate

From

January 2026

Want patterns like this in your inbox?

3 patterns weekly. No fluff.