How Hugging Face Turned Failed Training Runs into Accuracy Gains
TRIGGER
RL training for code generation tasks wastes failed rollouts—when the model produces incorrect code, it receives zero reward and learns nothing from the specific failure, even though the verifier's error message contains actionable debugging information.
APPROACH
Kimina-Prover stores failed rollouts (prompt, response, and Lean feedback) and creates new training samples where the model is prompted to revise its previous reasoning/code based on the error. Only one error-fix turn is allowed, with error messages capped at a set token limit. At each training step, half the samples are error correction samples. Results: Pass@32 improved from 72.95% to 76.23% on MiniF2F for 1.7B model, with error correction adding another 1.64% (to 77.87%) at inference time.
PATTERN
“Verifier error messages paired with failed generations become supervised examples for self-correction. Half your training samples can be error-fix turns when the failure rate is high enough.”
✓ WORKS WHEN
- Verifier produces structured, actionable error messages (type errors, tactic failures, assertion violations)
- Error messages are concise enough to fit in context without dominating the prompt
- Failure modes are recoverable—the error points to a fixable issue rather than fundamental approach problems
- Training infrastructure supports storing and replaying failed rollouts as new samples
- Task has high initial failure rate (>50%) providing abundant error correction examples
✗ FAILS WHEN
- Verifier only provides pass/fail without diagnostic information (black-box evaluation)
- Error messages are too verbose or noisy to extract signal (>500 tokens of stack traces)
- Most failures stem from wrong approach rather than fixable bugs—model needs to restart, not patch
- Training compute is constrained and error correction samples double the effective batch size
- Task has low failure rate (<10%) leaving insufficient error examples to learn from