Anthropic's Reproduce-First Approach to Code Agent Debugging
TRIGGER
Coding agents attempting to fix bugs would jump directly to modifying source code based on issue descriptions, but without a way to verify their fix actually resolved the problem, they would submit incorrect solutions or keep making changes without knowing if they were making progress.
APPROACH
Anthropic's SWE-bench agent prompt explicitly instructs the model to create a reproduction script before attempting fixes: 'Create a script to reproduce the error and execute it with python <filename.py>... to confirm the error.' After making code changes, the model reruns the reproduction script to verify the fix. In the example shown, the model created reproduce_error.py, confirmed the TypeError, made the fix, then re-verified.
PATTERN
“Without a reproduction script, your agent cannot tell "I fixed it" from "I think I fixed it." Require the agent to create and run a reproduction script before any code changes. The script is ground truth; everything else is the agent guessing.”
✓ WORKS WHEN
- The issue description contains reproducible steps or error conditions
- The error manifests in a way that can be checked programmatically
- The execution environment allows running arbitrary test scripts
- Bugs are behavioral (wrong output, exceptions) rather than subtle (race conditions, memory leaks)
- The reproduction script runs quickly enough to iterate on (<30 seconds)
✗ FAILS WHEN
- Issues are about code quality, style, or architecture rather than behavior
- Reproduction requires complex environment setup the agent can't automate
- The bug is intermittent or timing-dependent and hard to trigger reliably
- Issue descriptions are vague and don't specify expected vs actual behavior
- Verification requires human judgment (UI appearance, UX quality)