What HuggingFace Learned About Optimal Training Difficulty
TRIGGER
RL training on uniformly sampled problems wastes compute on easy examples the model already solves consistently (providing no gradient signal) while undersampling hard problems that would drive improvement.
APPROACH
Kimina-Prover preprocesses the training dataset by filtering out problems with historical win rate above 0.5, generating variants of existing problems using Gemini for diversity, and duplicating hard problems to increase their sampling weight. Input: NuminaMath-LEAN dataset with solve rates. Output: Kimina-Prover-Promptset with challenging, high-value problems. The 1.7B model trained on this curated set achieved 76.23% Pass@32 on MiniF2F, improving 3+ points over the distilled baseline.
PATTERN
“Easy problems teach nothing—they're already solved. Filter out tasks you win more than half the time, duplicate hard ones, and generate variants of valuable problems. Training compute should go where learning actually happens.”
✓ WORKS WHEN
- You have historical solve rate data from previous model versions or rollouts
- Problem difficulty varies significantly across the dataset (wide solve rate distribution)
- Hard problems are genuinely learnable—difficult but not impossible for the model
- Variant generation preserves problem difficulty while adding surface diversity
- Training is long enough for curriculum effects to manifest (dozens of passes through hard problems)
✗ FAILS WHEN
- No historical data exists and cold-start prevents win rate estimation
- Dataset difficulty is uniform—filtering removes too much or too little
- Hard problems require capabilities the model fundamentally lacks (filtering just removes signal)
- Variant generation drifts difficulty or introduces invalid problems
- Training budget is too small to benefit from focused curriculum (<1 epoch)