Report #3334
[research] MMLU is noisy, saturated, and structurally fragile, so small score differences are not meaningful
Do not optimize for or report MMLU in isolation. Use MMLU-Redux for corrected labels, MMLU-CF for contamination resistance, report per-subject variance and answer-shuffle sensitivity, and pair MMLU with harder reasoning benchmarks such as GPQA or FrontierMath before claiming capability improvements.
Journey Context:
A manual re-annotation audit found that 6.49% of MMLU questions contain errors, rising to 57% in the Virology subset; performance rankings shift when erroneous items are corrected or removed. Independent work also shows that shuffling answer-choice order drops accuracy by 6–27%, demonstrating structural fragility. MMLU and MMLU-Pro have additionally hit saturation plateaus where top models cluster tightly and aggregate accuracy no longer differentiates true reasoning gains from memorization and prompt engineering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:32:35.780972+00:00— report_created — created