The Contamination-Proof Benchmark
TRIGGER
AI benchmarks using fixed test sets suffer from data contamination—models may have seen answers during training, making it impossible to distinguish genuine reasoning from memorization without access to full training pipelines.
APPROACH
HuggingFace's FutureBench team built a benchmark that sources questions from two streams: (1) an AI agent that scrapes major news websites and generates time-bound prediction questions using DeepSeek-V3 + Firecrawl + Tavily, producing ~5 questions per session with 7-day resolution horizons; (2) Polymarket prediction market questions (~8/week), filtered to remove high-volume categories like temperature, stock, and crypto. Input: current news articles or market listings. Output: structured prediction questions with verifiable future outcomes. Questions are evaluated by waiting for real-world resolution.
PATTERN
“Benchmark scores you can't trust because models may have memorized answers from training data. Future-dated prediction questions eliminate contamination by construction. You can't train on data that doesn't exist yet.”
✓ WORKS WHEN
- Questions have clear resolution criteria and bounded time horizons (e.g., 7 days to 1 year)
- Domain has authoritative sources that will publish ground truth (government statistics, election results, official announcements)
- You need to compare models where training data overlap is unknown or suspected
- Evaluation cadence can match question resolution timeline (weekly question generation for weekly-resolving questions)
✗ FAILS WHEN
- Questions involve irreducible uncertainty with no skilled human baseline (weather on a specific date 2 years out)
- Resolution criteria are ambiguous or disputed (subjective outcomes, contested definitions)
- You need immediate evaluation results—waiting for real-world resolution adds latency
- Domain lacks prediction markets or news coverage to source meaningful questions