← Back to patterns
iterate

The One-Sided Evaluation Trap

TRIGGER

Teams optimizing agent behavior on one-sided evals—testing only whether the agent does X when it should—find themselves in oscillating loops where fixing undertriggering causes overtriggering, and vice versa, with no stable equilibrium.

APPROACH

When building web search evals for Claude.ai, Anthropic's team created balanced test sets covering both directions: queries where the model should search ("find the weather in Tokyo") and queries where it should answer from existing knowledge ("who founded Apple?"). Input: candidate eval suite. Output: balanced dataset with both "should trigger" and "should not trigger" cases. The challenge was preventing overtriggering (searching unnecessarily) while preserving the ability to do extensive research when appropriate. The team went through many rounds of refinements to both the prompts and the eval to strike the right balance, continuously adding new examples as edge cases emerged. Without both-sided testing, optimization created oscillating loops—fixing undertriggering caused overtriggering and vice versa, with no stable equilibrium visible until they measured pressure from both directions.

PATTERN

You'll get an agent that acts too often if you only test when it should act. One-sided evals create one-sided optimization—the equilibrium point only becomes visible when you measure pressure from both directions.

WORKS WHEN

  • Agent must decide whether to take an action (search, escalate, use tool) vs. not taking it
  • Both false positives and false negatives have meaningful costs
  • Edge cases exist where the right decision is ambiguous
  • Team can source examples of both 'should trigger' and 'should not trigger' scenarios

FAILS WHEN

  • Action has no meaningful false positive cost (better safe than sorry scenarios)
  • Ground truth for 'should not act' cases is unclear or contentious
  • Dataset imbalance is inherent to the domain and rebalancing would misrepresent reality
  • Eval is measuring capability ceiling rather than triggering precision

Stage

iterate

From

January 2026

Want patterns like this in your inbox?

3 patterns weekly. No fluff.