← Back to patterns
build

Why HuggingFace Separates Format Compliance from Correctness

TRIGGER

Training VLMs with RL using only accuracy-based rewards produces models that sometimes get correct answers but with malformed or unparseable output formats, making the responses unusable in downstream pipelines.

APPROACH

TRL's GRPO implementation uses two separate reward functions: format_reward checks if output matches a regex pattern (e.g., '^<think>.*?</think>\s*<answer>.*?</answer>$') returning 1.0 or 0.0; accuracy_reward uses math_verify library to parse and verify LaTeX answers against ground truth. Input: generated completions + ground truth solutions. Output: two scalar rewards per completion that are combined during policy optimization. Both rewards are passed as a list to GRPOTrainer, allowing the policy to learn structure and accuracy as independent skills.

PATTERN

A single reward function makes it impossible to diagnose whether failures are format or reasoning problems. Separate format compliance from correctness—the policy learns structure and accuracy as independent skills instead of conflating them.

WORKS WHEN

  • Output must follow specific structure (XML tags, JSON, LaTeX notation)
  • Downstream systems require parseable format to extract answers
  • Need to diagnose whether model failures are format or reasoning problems
  • Format requirements are checkable with regex or schema validation

FAILS WHEN

  • Output format is freeform text with no structural requirements
  • Format violations are rare enough to not affect training signal
  • Adding a second reward function doubles reward computation cost unacceptably
  • Format and accuracy are inherently coupled (format IS the answer, like classification labels)

Stage

build

From

August 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.