build

Anthropic's Reproduce-First Approach to Code Agent Debugging

TRIGGER

Coding agents attempting to fix bugs would jump directly to modifying source code based on issue descriptions, but without a way to verify their fix actually resolved the problem, they would submit incorrect solutions or keep making changes without knowing if they were making progress.

APPROACH

Anthropic's SWE-bench agent prompt explicitly instructs the model to create a reproduction script before attempting fixes: 'Create a script to reproduce the error and execute it with python <filename.py>... to confirm the error.' After making code changes, the model reruns the reproduction script to verify the fix. In the example shown, the model created reproduce_error.py, confirmed the TypeError, made the fix, then re-verified.

PATTERN

“Without a reproduction script, your agent cannot tell "I fixed it" from "I think I fixed it." Require the agent to create and run a reproduction script before any code changes. The script is ground truth; everything else is the agent guessing.”

✓ WORKS WHEN

The issue description contains reproducible steps or error conditions
The error manifests in a way that can be checked programmatically
The execution environment allows running arbitrary test scripts
Bugs are behavioral (wrong output, exceptions) rather than subtle (race conditions, memory leaks)
The reproduction script runs quickly enough to iterate on (<30 seconds)

✗ FAILS WHEN

Issues are about code quality, style, or architecture rather than behavior
Reproduction requires complex environment setup the agent can't automate
The bug is intermittent or timing-dependent and hard to trigger reliably
Issue descriptions are vague and don't specify expected vs actual behavior
Verification requires human judgment (UI appearance, UX quality)

Stage

build

Source

Anthropic Engineering →

From

January 2025