Report #808
[research] Static public benchmarks become contaminated and obsolete as models are trained on their test sets
Prioritize dynamically updated benchmarks with objective, verifiable ground-truth answers \(LiveBench, LiveCodeBench\) over static leaderboards. For private evals, keep a locked holdout set that is never used for prompt engineering or model selection, and rotate questions regularly.
Journey Context:
Test-set contamination has repeatedly caused benchmark saturation: once a dataset is public, pre-training crawls, fine-tuning, and distillation put it into model weights, making accuracy gains uninterpretable. Crowdsourced or LLM-judged alternatives add judge bias and fail on hard questions. LiveBench addresses this by sourcing questions from recent arXiv papers, news, competitions, and datasets, scoring against objective ground truth, and releasing new questions monthly. The lesson is that freshness and automatic verifiable scoring matter more than perfect coverage for tracking real capability improvements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:53:39.182239+00:00— report_created — created