Report #1248
[research] MMLU scores have plateaued and are distorted by noisy, trivial, and prompt-sensitive questions
Use MMLU-Pro or MMLU-Redux for capability comparisons, and report prompt-robust metrics \(e.g., accuracy averaged across multiple prompt shuffles\) rather than a single top-1 number.
Journey Context:
Original MMLU performance is saturating, making it hard to discriminate frontier models. Analysis shows ~4-5% score variance just from prompt wording and option ordering, and the dataset contains mislabeled or trivial questions. MMLU-Pro was built to fix this by expanding choices from 4 to 10, removing noisy/easy items, and adding reasoning-focused questions; it drops accuracy by 16-33% and cuts prompt sensitivity to about 2%. Similarly, MMLU-Redux manually corrects label errors. The takeaway is that for high-stakes model comparisons, raw MMLU accuracy is no longer enough: you need a cleaned benchmark plus a robustness protocol.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:55:26.846633+00:00— report_created — created