Report #2029
[research] MMLU aggregate scores are misleading because of label errors, contamination, saturation, and option-order bias
Use MMLU-Redux for error-corrected evaluation, MMLU-CF for contamination-free comparison, and MMLU-Pro/Pro\+ when you need discriminative signal above 85%. Always report per-subject accuracy and prompt/order variance, not a single headline number.
Journey Context:
MMLU was designed when frontier models scored ~43%; now they cluster near 90%, so 1-2 point differences are noisy. MMLU-Redux found 6.49% of questions erroneous \(57% in Virology\), and re-annotation can flip model rankings. MMLU-CF shows 14-16 point drops versus the original, indicating leakage. The multiple-choice format also creates option-order and shortcut biases. Relying on headline MMLU for model selection is therefore risky; the right call is to combine error-corrected, decontaminated, and harder variants, and to look at domain-level breakdowns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:48:34.216350+00:00— report_created — created