iterate

Why Anthropic Grades Outcomes Over Trajectories

TRIGGER

Teams building agent evaluations instinctively check that agents followed specific steps—like a sequence of tool calls in the right order—but this approach produces brittle tests that fail when agents find valid alternative solutions the eval designers didn't anticipate.

APPROACH

Anthropic's eval teams shifted from trajectory-based grading to outcome-based grading after finding that checking tool call sequences produces overly brittle tests—agents regularly find valid approaches eval designers didn't anticipate. Input: completed agent trial. Output: pass/fail based on end-state verification, not step sequence. For a coding agent fixing an authentication bypass vulnerability, instead of requiring 'read_file → edit_file → run_tests' in order, they verify deterministic outcomes: unit tests pass and security_logs show `{event_type: "auth_blocked"}`. For conversational agents on tau-bench, they check database state: `tickets: {status: resolved}`, `refunds: {status: processed}`. When Opus 4.5 "failed" a tau2-bench flight booking task by discovering a policy loophole that actually served the user better, it demonstrated why outcome grading matters—the trajectory was unexpected but the outcome was superior.

PATTERN

“Your evals will penalize smarter agents that find better solutions. Encoding your expected path as the only valid path turns capability tests into conformity tests—agents that succeed differently still succeed.”

✓ WORKS WHEN

Tasks have objectively verifiable end states (tests pass, database record exists, file created with correct content)
Multiple valid approaches exist to solve the problem
Agent capabilities may exceed eval designer expectations
You're measuring capability rather than process compliance

✗ FAILS WHEN

The process itself is what you're evaluating (safety-critical sequences, compliance workflows)
Outcomes are equivalent but some paths have unacceptable costs (token usage, latency, API calls)
You need to debug failures and trajectory information is essential for understanding what went wrong
Regulatory requirements mandate specific procedural steps regardless of outcome

Stage

iterate

Source

Anthropic Engineering →

From

January 2026