iterate

Why Anthropic Uses Pass@k vs Pass^k for Different Reliability Goals

TRIGGER

Teams using simple pass/fail metrics for non-deterministic agents can't distinguish agents that occasionally succeed from agents that reliably succeed—a 75% success rate might mean 'usually works' or 'works 3 out of 4 times unpredictably,' which have very different product implications.

APPROACH

Anthropic's eval teams run multiple trials per task and compute both pass@k (probability of at least one success in k attempts) and pass^k (probability all k trials succeed) based on product requirements. Input: agent runs multiple trials on each task. Output: two metrics revealing different reliability characteristics. For coding agents evaluated on SWE-bench, they prioritize pass@1 since users expect first-try solutions. For customer-facing conversational agents tested on tau2-bench, they calculate pass^k where k reflects typical user session length—revealing that a 75% per-trial success rate translates to only (0.75)^3 = 42% chance of three consecutive successes. This dual-metric approach helped teams distinguish between agents that occasionally succeed versus agents that reliably succeed.

PATTERN

“A 75% per-task success rate means only 42% chance of three consecutive successes—that gap is why users complain about "unreliable" agents with "good" metrics. Use pass@k for "can it eventually work" and pass^k for "will it work every time."”

✓ WORKS WHEN

Agent output varies between runs due to model non-determinism
Running multiple trials per task is feasible
Product requirements distinguish between 'any success' and 'consistent success'
k (number of trials) is meaningful in the product context

✗ FAILS WHEN

Single-trial evaluation only (k=1 makes metrics equivalent)
Tasks are deterministic so all trials produce identical results
Neither 'eventual success' nor 'consistent success' maps to product needs
Trial count k is arbitrary rather than reflecting actual usage patterns

Stage

iterate

Source

Anthropic Engineering →

From

January 2026