← Back to patterns
build

The Precision Disagreement Trap in Distributed Inference

TRIGGER

Token generation was occasionally dropping the highest-probability token entirely, causing nonsensical outputs (e.g., Thai characters appearing in English responses), but the bug was inconsistent—the same prompt might work perfectly on one request and fail on the next.

APPROACH

Anthropic traced the issue to mixed precision arithmetic in their distributed top-k sampling. Models compute probabilities in bf16 (16-bit), but the TPU's vector processor is fp32-native, so the XLA compiler optimizes some operations to fp32 via the `xla_allow_excess_precision` flag. Operations that should agree on the highest-probability token ran at different precision levels, causing disagreement about which token ranked highest. The highest-probability token sometimes disappeared from consideration. Resolution: switched from approximate to exact top-k and standardized additional operations on fp32 precision, accepting the minor efficiency impact because 'model quality is non-negotiable.'

PATTERN

Thai characters appearing in English responses, highest-probability tokens vanishing—the "precision disagreement trap" where bf16 and fp32 operations on different chips disagree about rankings. Compiler precision "optimizations" become correctness bugs when operations require consensus, not approximate equivalence. Trace precision levels when debugging intermittent quality issues in multi-chip inference.

WORKS WHEN

  • Model inference is distributed across multiple chips (TPUs, GPUs, Trainium) requiring coordination
  • Operations involve ranking, sorting, or comparisons that require agreement across distributed components
  • Compiler applies automatic precision optimization (e.g., bf16 to fp32 promotion)
  • Bug manifests as intermittent quality degradation rather than crashes or obvious errors
  • Symptoms change based on batch size, model configuration, or surrounding operations

FAILS WHEN

  • Inference runs on single chip with no distributed coordination needed
  • All operations explicitly pinned to same precision level with no compiler optimization
  • Operations are purely parallel with no need for cross-component agreement
  • Outputs are obviously broken (would be caught by basic validation)

Stage

build

From

September 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.