build

What Anthropic Learned from Weeks of Undetected Quality Degradation

TRIGGER

Infrastructure bugs degraded production quality for weeks before detection despite having benchmarks, safety evaluations, and canary deployments—evaluations simply didn't capture the degradation users were reporting.

APPROACH

Anthropic discovered their validation process had a critical gap: evaluations ran on representative systems but not on actual production traffic paths. A routine load balancing change on August 29 increased affected traffic from 0.8% to 16% at peak, but evaluations didn't catch it because they weren't running on the production systems that received the changed routing. They're now implementing continuous quality evaluations on true production systems to catch issues like the context window load balancing error that only manifested in real production routing.

PATTERN

“A load balancer change pushed affected traffic from 0.8% to 16%, but canary evals saw nothing for weeks. Production-equivalent is not production. Run quality evals on actual production traffic paths, because the bugs that survive are the ones that depend on production-specific routing.”

✓ WORKS WHEN

Infrastructure has multiple deployment targets with different configurations (multi-cloud, multi-hardware)
Routing logic, load balancing, or feature flags create divergent traffic paths
Changes can affect subset of production traffic while leaving evaluation systems unchanged
Quality issues are intermittent or configuration-dependent rather than universal
Canary deployments use different traffic patterns than full production

✗ FAILS WHEN

Single homogeneous deployment target with no routing complexity
Evaluation environment receives actual production traffic (is production)
All changes affect all traffic equally with no subset targeting
Bugs are deterministic and reproducible in any environment

Stage

build

Source

Anthropic Engineering →

From

September 2025