Report #554
[research] MMLU scores are unreliable for fine-grained model comparison because the benchmark contains ~6.5% ground-truth errors and is saturated
Do not compare models on MMLU differences smaller than ~1-2 percentage points. Use MMLU-Redux \(error-corrected subset\) or MMLU-Pro \(10-option, reviewed questions\), always report the exact prompt/few-shot setting, and pair MMLU with domain-specific tasks rather than treating it as a single leaderboard score.
Journey Context:
A manual audit of 5,700 MMLU questions estimated a 6.49% error rate, with up to 57% error in some subjects \(Virology\). Because top models now cluster near 86-89%, measurement noise from bad questions and prompt sensitivity \(4-5% in MMLU, ~13 pp variance reported for MMLU-Pro\) can exceed real capability gaps. The common error is ranking models by tiny MMLU deltas. MMLU-Redux fixes answer keys but reduces sample size and does not eliminate all format biases; MMLU-Pro is harder but still multiple-choice. Use MMLU as a coarse sanity check, not a tie-breaker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:24.200166+00:00— report_created — created