What Anthropic Learned from Weeks of Undetected Quality Degradation
TRIGGER
Infrastructure bugs degraded production quality for weeks before detection despite having benchmarks, safety evaluations, and canary deployments—evaluations simply didn't capture the degradation users were reporting.
APPROACH
Anthropic discovered their validation process had a critical gap: evaluations ran on representative systems but not on actual production traffic paths. A routine load balancing change on August 29 increased affected traffic from 0.8% to 16% at peak, but evaluations didn't catch it because they weren't running on the production systems that received the changed routing. They're now implementing continuous quality evaluations on true production systems to catch issues like the context window load balancing error that only manifested in real production routing.
PATTERN
“A load balancer change pushed affected traffic from 0.8% to 16%, but canary evals saw nothing for weeks. Production-equivalent is not production. Run quality evals on actual production traffic paths, because the bugs that survive are the ones that depend on production-specific routing.”
✓ WORKS WHEN
- Infrastructure has multiple deployment targets with different configurations (multi-cloud, multi-hardware)
- Routing logic, load balancing, or feature flags create divergent traffic paths
- Changes can affect subset of production traffic while leaving evaluation systems unchanged
- Quality issues are intermittent or configuration-dependent rather than universal
- Canary deployments use different traffic patterns than full production
✗ FAILS WHEN
- Single homogeneous deployment target with no routing complexity
- Evaluation environment receives actual production traffic (is production)
- All changes affect all traffic equally with no subset targeting
- Bugs are deterministic and reproducible in any environment