iterate

The Step-by-Step Assertion Trap

TRIGGER

Evaluating agents that modify persistent state across multi-turn conversations breaks traditional testing approaches—each action changes the environment for subsequent steps, creating dependencies, and agents may take completely different valid paths to reach the same goal.

APPROACH

Anthropic shifted from turn-by-turn process validation to end-state evaluation. Instead of checking if agents followed prescribed steps, they evaluate whether agents achieved the correct final state. For complex workflows, they break evaluation into discrete checkpoints where specific state changes should have occurred. They use an LLM judge with a single prompt evaluating outputs on: factual accuracy, citation accuracy, completeness, source quality, and tool efficiency, outputting scores from 0.0-1.0 and pass-fail grades. This method was especially effective when test cases had clear answers. Input: agent conversation trace + expected end state. Output: rubric-based scores and pass/fail determination.

PATTERN

“Constant test maintenance from the "step-by-step assertion trap"—your evals break every time the agent finds a valid alternative path you didn't anticipate. Agent paths are non-deterministic but outcomes are verifiable. Judge the destination: did the database end up in state X? Does the output contain Y?”

✓ WORKS WHEN

Task has a verifiable end state (correct answer, required state changes, expected artifacts)
Multiple valid paths exist to reach the goal
Intermediate steps are means to an end, not requirements themselves
Evaluation can be expressed as discrete checkpoints rather than continuous process monitoring
LLM judge can reliably assess correctness against a rubric

✗ FAILS WHEN

Process compliance matters (regulated domains, audit requirements)
Side effects of wrong paths are harmful even if end state is correct
End state is subjective or has no ground truth to compare against
Agent failures are path-dependent and diagnosing them requires turn-by-turn analysis
Task involves exploration where 'correct end state' can't be defined in advance

Stage

iterate

Source

Anthropic Engineering →

From

June 2025