Showing 1-11 of 11 patterns
Teams using simple pass/fail metrics for non-deterministic agents can't distinguish agents that occasionally succeed from agents that reliably succeed—a 75% success rate might mean 'usually works' or 'works 3 out of 4 times unpredictably,' which have very different product implications.
Source: Anthropic Engineering • January 2026
iterateTeams delay building evaluations because they believe they need hundreds of comprehensive test cases before starting, while untracked regressions accumulate in their shipping agent.
Source: Anthropic Engineering • January 2026
iterateSingle AI coding agents accumulate context bias during implementation—they become invested in their own approach and miss issues that fresh eyes would catch, similar to how human developers benefit from code review by colleagues who weren't involved in writing the code.
Source: Anthropic Engineering • April 2025
iterateTeams running multi-trial agent evaluations get unreliable results—correlated failures from shared infrastructure issues inflate failure rates, while agents gaining unfair advantages from previous trial artifacts inflate success rates—making it impossible to distinguish agent capability from environmental noise.
Source: Anthropic Engineering • January 2026
iterateTeams building agent evaluation suites struggle with the lifecycle of their tests—capability evals that once measured 'can we do this at all?' become meaningless when pass rates approach 100%, while there's no systematic way to prevent backsliding on previously achieved capabilities.
Source: Anthropic Engineering • January 2026
iterateTeams building agent evaluations instinctively check that agents followed specific steps—like a sequence of tool calls in the right order—but this approach produces brittle tests that fail when agents find valid alternative solutions the eval designers didn't anticipate.
Source: Anthropic Engineering • January 2026
iterateEvaluating agents that modify persistent state across multi-turn conversations breaks traditional testing approaches—each action changes the environment for subsequent steps, creating dependencies, and agents may take completely different valid paths to reach the same goal.
Source: Anthropic Engineering • June 2025
iterateTeams optimizing agent behavior on one-sided evals—testing only whether the agent does X when it should—find themselves in oscillating loops where fixing undertriggering causes overtriggering, and vice versa, with no stable equilibrium.
Source: Anthropic Engineering • January 2026
iterateUsers were getting mediocre results from AI because they inherited search behavior—typing vague queries and expecting useful outputs. The gap between what users asked for and what they actually needed was invisible to them.
Source: Notion • December 2025
iterateAI confidently delivered outdated information because knowledge was scattered across tools (Slack, Google Drive, GitHub) maintained by different teams, constantly out of sync. The knowledge rot problem got worse with AI because wrong answers were served with high confidence.
Source: Notion • December 2025
iterateSupport teams needed to route incoming issues to the right team and detect duplicates, but fully automated routing risked misclassification and missed context that only humans could catch—particularly for nuanced feature requests that need aggregation with existing projects.
Source: Linear • November 2025