build

What Anthropic Learned About Agent Self-Assessment Failures

TRIGGER

Autonomous agents operating over multiple steps can drift off course or compound errors without realizing it. The agent's internal 'reasoning' about what happened isn't reliable—it may hallucinate success or misunderstand the state of the world.

APPROACH

Anthropic's coding agents and computer-use implementation get concrete environmental feedback at each step: tool call results, code execution output, test results, screen state. The agent uses this ground truth to assess progress rather than relying on its own predictions about what should have happened. For coding agents on SWE-bench, automated tests verify functionality. Human checkpoints are added for blocking decisions or when confidence is low. Stopping conditions (max iterations) prevent runaway execution.

PATTERN

“The model's belief about what happened is less reliable than checking what actually happened. Parse tool results, capture execution output, verify file system state. Environment observation is ground truth; self-assessment is hallucination-prone.”

✓ WORKS WHEN

Environment provides observable signals of success or failure (test pass/fail, API response codes, file system state changes)
Tasks have clear completion criteria that can be checked programmatically
Cost of environment queries is low relative to cost of agent mistakes
Agent operates in sandboxed environment where probing state is safe

✗ FAILS WHEN

Success criteria are subjective and can't be measured from environment (creative writing quality, design aesthetics)
Environment feedback is delayed or unavailable (batch processing, async workflows with long latency)
Querying environment state is expensive or has side effects
Environment state is too complex to summarize in a way the agent can interpret

Stage

build

Source

Anthropic Engineering →

From

December 2024