Why Anthropic Uses Pass@k vs Pass^k for Different Reliability Goals
TRIGGER
Teams using simple pass/fail metrics for non-deterministic agents can't distinguish agents that occasionally succeed from agents that reliably succeed—a 75% success rate might mean 'usually works' or 'works 3 out of 4 times unpredictably,' which have very different product implications.
APPROACH
Anthropic's eval teams run multiple trials per task and compute both pass@k (probability of at least one success in k attempts) and pass^k (probability all k trials succeed) based on product requirements. Input: agent runs multiple trials on each task. Output: two metrics revealing different reliability characteristics. For coding agents evaluated on SWE-bench, they prioritize pass@1 since users expect first-try solutions. For customer-facing conversational agents tested on tau2-bench, they calculate pass^k where k reflects typical user session length—revealing that a 75% per-trial success rate translates to only (0.75)^3 = 42% chance of three consecutive successes. This dual-metric approach helped teams distinguish between agents that occasionally succeed versus agents that reliably succeed.
PATTERN
“A 75% per-task success rate means only 42% chance of three consecutive successes—that gap is why users complain about "unreliable" agents with "good" metrics. Use pass@k for "can it eventually work" and pass^k for "will it work every time."”
✓ WORKS WHEN
- Agent output varies between runs due to model non-determinism
- Running multiple trials per task is feasible
- Product requirements distinguish between 'any success' and 'consistent success'
- k (number of trials) is meaningful in the product context
✗ FAILS WHEN
- Single-trial evaluation only (k=1 makes metrics equivalent)
- Tasks are deterministic so all trials produce identical results
- Neither 'eventual success' nor 'consistent success' maps to product needs
- Trial count k is arbitrary rather than reflecting actual usage patterns