What Anthropic Learned About Eval Coverage in Early Development
TRIGGER
Teams delay building evaluations because they believe they need hundreds of comprehensive test cases before starting, while untracked regressions accumulate in their shipping agent.
APPROACH
Anthropic recommends starting with 20-50 simple tasks drawn from real failures rather than waiting for comprehensive coverage. The rationale: early agent development has large effect sizes where each system change has noticeable impact, so small sample sizes suffice for detection. Teams source initial tasks from manual QA checks, bug trackers, and support queues—converting user-reported failures into test cases ensures the suite reflects actual usage. Claude Code started with 'fast iteration based on feedback,' then added evals 'first for narrow areas like concision and file edits, and then for more complex behaviors like over-engineering.'
PATTERN
“Regressions ship while you wait for comprehensive test coverage. Pull 20-50 real failures from your bug tracker now. Early systems have large effect sizes; small targeted suites catch what would need hundreds of tests later.”
✓ WORKS WHEN
- Agent is in early-to-mid development with frequent significant changes
- Real user failures exist to source from (production, dogfooding, or beta)
- Team has < 100 distinct failure modes identified
- Each system change typically has visible impact on behavior
✗ FAILS WHEN
- Agent is mature and changes have small, subtle effects requiring statistical power
- No production usage exists to source real failures from
- Compliance or safety requirements mandate comprehensive coverage
- Team is optimizing against saturated capability benchmarks where every edge case matters