Report #3561
[research] MMLU scores are plateauing and the benchmark mixes trivia, label noise, and format bias
Use MMLU-Pro for capability discrimination: 10 choices, reasoning-focused questions, CoT-friendly, lower prompt sensitivity; keep original MMLU only for historical comparison.
Journey Context:
MMLU has become saturated, so small model differences are within noise, many questions are pure memorization or contain errors, and the 4-option format allows lucky guessing. MMLU-Pro expands choices to 10, removes trivial/noisy items, and adds reasoning questions. It shows a 16-33% accuracy drop and only ~2% prompt sensitivity versus 4-5% on MMLU. The common mistake is comparing models on raw MMLU without confidence intervals, prompt-version matching, or checking whether gains come from memorization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:33:17.665092+00:00— report_created — created