Report #3333
[research] Public benchmarks like MMLU and HumanEval are heavily contaminated in pre-training corpora, inflating reported model scores
Treat public benchmark scores as upper bounds. For real model selection, build private evals from proprietary or non-indexed data, use live/dynamic benchmarks such as LiveBench or SWE-bench Live, and apply procedural decontamination \(rephrase questions, shuffle answer choices, replace distractors\). Always report clean-vs-contaminated deltas, not a single headline number.
Journey Context:
Systematic contamination studies show that major benchmarks including MMLU, HumanEval, HellaSwag, and Big-Bench Hard have substantial overlap with pre-training corpora, and the performance gain from contaminated examples can reach 10–25 percentage points on affected sets. Static public benchmarks are inherently leak-prone because their exact instances circulate on the web and in training mixes. The robust response is not to find a 'clean' public benchmark, but to evaluate on private hold-out data or continuously refreshed tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:32:34.176344+00:00— report_created — created