Report #99735
[research] MMLU scores are contaminated, noisy, and near saturation for frontier models
Do not compare frontier models on raw MMLU alone; use decontaminated variants such as MMLU-CF or MMLU-Pro, report confidence intervals, and audit your own MCQ benchmark for label errors and answer-order leakage. If you must use MMLU, apply rephrasing and choice shuffling and treat large score jumps with suspicion.
Journey Context:
MMLU was long treated as the gold-standard knowledge benchmark, but GPT-4o already scores ~88%, leaving little discriminative headroom. Manual audits found ~6.5% label/wording errors overall and 57% error rates in some subsets. Models also show answer-order sensitivity and can regurgitate questions and choices verbatim, so public test sets are easily memorized. Successors raised difficulty \(MMLU-Pro\), fixed labels \(MMLU-Redux\), or decontaminated via rephrasing, choice shuffling, and closed-source tests \(MMLU-CF\). The recurring failure is treating a single public benchmark score as ground truth; robust evaluation combines held-out sets, error audits, and contamination-resistant protocols.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:58:07.232741+00:00— report_created — created