build

What Anthropic Learned About Agent Scaffolding from SWE-bench

TRIGGER

Agent systems for software engineering tasks were being built with complex scaffolding that hardcoded specific workflows, limiting the model's ability to adapt its approach based on the problem at hand.

APPROACH

Anthropic built a SWE-bench agent with minimal scaffolding: a prompt outlining suggested steps, a Bash tool for executing commands, and an Edit tool for file operations. The model controls its own workflow—deciding when to explore code, create reproduction scripts, edit files, and verify fixes—rather than being forced through discrete stages. The agent continues until the model decides it's finished or exceeds 200k context. On SWE-bench Verified, this achieved 49% (vs 45% previous SOTA, 33% with older Claude 3.5 Sonnet).

PATTERN

“Hardcoded workflows become a ceiling, not a floor. Anthropic's minimal-scaffold agent matched or beat complex multi-stage pipelines on SWE-bench. When the model is capable enough, let it control workflow.”

✓ WORKS WHEN

The underlying model has strong reasoning and self-correction capabilities
Tasks are varied enough that a single hardcoded workflow can't fit all cases
Context window is large enough to accommodate extended exploration (200k+ tokens)
You can afford high token costs for tenacious multi-turn problem solving (100k+ tokens per task)
The model can recognize when it's done vs. when to keep trying

✗ FAILS WHEN

The model lacks judgment to choose appropriate next steps and needs guardrails
Cost per task must be tightly controlled (this approach used hundreds of turns on some tasks)
Tasks follow predictable patterns where a hardcoded workflow would be more efficient
Context limits are tight and exploration would exhaust the window before solving
Latency requirements prevent multi-turn exploration

Stage

build

Source

Anthropic Engineering →

From

January 2025