iterate

What Anthropic Learned About Eval Coverage in Early Development

TRIGGER

Teams delay building evaluations because they believe they need hundreds of comprehensive test cases before starting, while untracked regressions accumulate in their shipping agent.

APPROACH

Anthropic recommends starting with 20-50 simple tasks drawn from real failures rather than waiting for comprehensive coverage. The rationale: early agent development has large effect sizes where each system change has noticeable impact, so small sample sizes suffice for detection. Teams source initial tasks from manual QA checks, bug trackers, and support queues—converting user-reported failures into test cases ensures the suite reflects actual usage. Claude Code started with 'fast iteration based on feedback,' then added evals 'first for narrow areas like concision and file edits, and then for more complex behaviors like over-engineering.'

PATTERN

“Regressions ship while you wait for comprehensive test coverage. Pull 20-50 real failures from your bug tracker now. Early systems have large effect sizes; small targeted suites catch what would need hundreds of tests later.”

✓ WORKS WHEN

Agent is in early-to-mid development with frequent significant changes
Real user failures exist to source from (production, dogfooding, or beta)
Team has < 100 distinct failure modes identified
Each system change typically has visible impact on behavior

✗ FAILS WHEN

Agent is mature and changes have small, subtle effects requiring statistical power
No production usage exists to source real failures from
Compliance or safety requirements mandate comprehensive coverage
Team is optimizing against saturated capability benchmarks where every edge case matters

Stage

iterate

Source

Anthropic Engineering →

From

January 2026