What Anthropic Learned About Agent Scaffolding from SWE-bench
TRIGGER
Agent systems for software engineering tasks were being built with complex scaffolding that hardcoded specific workflows, limiting the model's ability to adapt its approach based on the problem at hand.
APPROACH
Anthropic built a SWE-bench agent with minimal scaffolding: a prompt outlining suggested steps, a Bash tool for executing commands, and an Edit tool for file operations. The model controls its own workflow—deciding when to explore code, create reproduction scripts, edit files, and verify fixes—rather than being forced through discrete stages. The agent continues until the model decides it's finished or exceeds 200k context. On SWE-bench Verified, this achieved 49% (vs 45% previous SOTA, 33% with older Claude 3.5 Sonnet).
PATTERN
“Hardcoded workflows become a ceiling, not a floor. Anthropic's minimal-scaffold agent matched or beat complex multi-stage pipelines on SWE-bench. When the model is capable enough, let it control workflow.”
✓ WORKS WHEN
- The underlying model has strong reasoning and self-correction capabilities
- Tasks are varied enough that a single hardcoded workflow can't fit all cases
- Context window is large enough to accommodate extended exploration (200k+ tokens)
- You can afford high token costs for tenacious multi-turn problem solving (100k+ tokens per task)
- The model can recognize when it's done vs. when to keep trying
✗ FAILS WHEN
- The model lacks judgment to choose appropriate next steps and needs guardrails
- Cost per task must be tightly controlled (this approach used hundreds of turns on some tasks)
- Tasks follow predictable patterns where a hardcoded workflow would be more efficient
- Context limits are tight and exploration would exhaust the window before solving
- Latency requirements prevent multi-turn exploration