build

The Precision Disagreement Trap in Distributed Inference

TRIGGER

Token generation was occasionally dropping the highest-probability token entirely, causing nonsensical outputs (e.g., Thai characters appearing in English responses), but the bug was inconsistent—the same prompt might work perfectly on one request and fail on the next.

APPROACH

Anthropic traced the issue to mixed precision arithmetic in their distributed top-k sampling. Models compute probabilities in bf16 (16-bit), but the TPU's vector processor is fp32-native, so the XLA compiler optimizes some operations to fp32 via the `xla_allow_excess_precision` flag. Operations that should agree on the highest-probability token ran at different precision levels, causing disagreement about which token ranked highest. The highest-probability token sometimes disappeared from consideration. Resolution: switched from approximate to exact top-k and standardized additional operations on fp32 precision, accepting the minor efficiency impact because 'model quality is non-negotiable.'

PATTERN

“Thai characters appearing in English responses, highest-probability tokens vanishing—the "precision disagreement trap" where bf16 and fp32 operations on different chips disagree about rankings. Compiler precision "optimizations" become correctness bugs when operations require consensus, not approximate equivalence. Trace precision levels when debugging intermittent quality issues in multi-chip inference.”

✓ WORKS WHEN

Model inference is distributed across multiple chips (TPUs, GPUs, Trainium) requiring coordination
Operations involve ranking, sorting, or comparisons that require agreement across distributed components
Compiler applies automatic precision optimization (e.g., bf16 to fp32 promotion)
Bug manifests as intermittent quality degradation rather than crashes or obvious errors
Symptoms change based on batch size, model configuration, or surrounding operations

✗ FAILS WHEN

Inference runs on single chip with no distributed coordination needed
All operations explicitly pinned to same precision level with no compiler optimization
Operations are purely parallel with no need for cross-component agreement
Outputs are obviously broken (would be caught by basic validation)

Stage

build

Source

Anthropic Engineering →

From

September 2025