← Back to patterns
iterate

The Step-by-Step Assertion Trap

TRIGGER

Evaluating agents that modify persistent state across multi-turn conversations breaks traditional testing approaches—each action changes the environment for subsequent steps, creating dependencies, and agents may take completely different valid paths to reach the same goal.

APPROACH

Anthropic shifted from turn-by-turn process validation to end-state evaluation. Instead of checking if agents followed prescribed steps, they evaluate whether agents achieved the correct final state. For complex workflows, they break evaluation into discrete checkpoints where specific state changes should have occurred. They use an LLM judge with a single prompt evaluating outputs on: factual accuracy, citation accuracy, completeness, source quality, and tool efficiency, outputting scores from 0.0-1.0 and pass-fail grades. This method was especially effective when test cases had clear answers. Input: agent conversation trace + expected end state. Output: rubric-based scores and pass/fail determination.

PATTERN

Constant test maintenance from the "step-by-step assertion trap"—your evals break every time the agent finds a valid alternative path you didn't anticipate. Agent paths are non-deterministic but outcomes are verifiable. Judge the destination: did the database end up in state X? Does the output contain Y?

WORKS WHEN

  • Task has a verifiable end state (correct answer, required state changes, expected artifacts)
  • Multiple valid paths exist to reach the goal
  • Intermediate steps are means to an end, not requirements themselves
  • Evaluation can be expressed as discrete checkpoints rather than continuous process monitoring
  • LLM judge can reliably assess correctness against a rubric

FAILS WHEN

  • Process compliance matters (regulated domains, audit requirements)
  • Side effects of wrong paths are harmful even if end state is correct
  • End state is subjective or has no ground truth to compare against
  • Agent failures are path-dependent and diagnosing them requires turn-by-turn analysis
  • Task involves exploration where 'correct end state' can't be defined in advance

Stage

iterate

From

June 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.