Report #5001
[research] High scores on popular static benchmarks like MMLU and HumanEval can reflect training-data contamination rather than true capability
For model selection, weight contamination-resistant signals \(recent or post-cutoff benchmarks, private held-out sets, live task suites\) more heavily than public leaderboard numbers; run your own decontamination check on any custom eval.
Journey Context:
Benchmarks diffuse into pretraining corpora through GitHub, arXiv, StackExchange, and synthetic datasets. Studies have found substantial contamination in MMLU, HumanEval, and other widely used benchmarks, and contamination-free variants show large score drops. Frontier models also saturate some benchmarks, compressing scores into a narrow band where ranking differences are mostly noise. The common mistake is treating a top MMLU score as a capability certificate. The robust move is triangulation: combine static benchmarks, dynamic or live evals, and task-specific private tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:29:21.715315+00:00— report_created — created