build

What HuggingFace Learned About Optimal Training Difficulty

TRIGGER

RL training on uniformly sampled problems wastes compute on easy examples the model already solves consistently (providing no gradient signal) while undersampling hard problems that would drive improvement.

APPROACH

Kimina-Prover preprocesses the training dataset by filtering out problems with historical win rate above 0.5, generating variants of existing problems using Gemini for diversity, and duplicating hard problems to increase their sampling weight. Input: NuminaMath-LEAN dataset with solve rates. Output: Kimina-Prover-Promptset with challenging, high-value problems. The 1.7B model trained on this curated set achieved 76.23% Pass@32 on MiniF2F, improving 3+ points over the distilled baseline.

PATTERN

“Easy problems teach nothing—they're already solved. Filter out tasks you win more than half the time, duplicate hard ones, and generate variants of valuable problems. Training compute should go where learning actually happens.”

✓ WORKS WHEN

You have historical solve rate data from previous model versions or rollouts
Problem difficulty varies significantly across the dataset (wide solve rate distribution)
Hard problems are genuinely learnable—difficult but not impossible for the model
Variant generation preserves problem difficulty while adding surface diversity
Training is long enough for curriculum effects to manifest (dozens of passes through hard problems)

✗ FAILS WHEN

No historical data exists and cold-start prevents win rate estimation
Dataset difficulty is uniform—filtering removes too much or too little
Hard problems require capabilities the model fundamentally lacks (filtering just removes signal)
Variant generation drifts difficulty or introduces invalid problems
Training budget is too small to benefit from focused curriculum (<1 epoch)

Stage

build

Source

Hugging Face →

From

August 2025