Report #675
[research] MMLU is saturated and prompt-sensitive, so aggregate scores stop discriminating frontier models
Use MMLU-Pro instead of original MMLU for capability tracking: it expands choices from 4 to 10, removes trivial/noisy questions, and is engineered for reasoning. Report per-subject breakdowns, use chain-of-thought \(CoT helps on Pro but not on original MMLU\), and treat the aggregate as a dashboard metric rather than a single capability score.
Journey Context:
Original MMLU plateaued as frontier models neared ceiling, and scores bounced 4-5% with prompt variations because many items were pure knowledge recall or poorly calibrated. MMLU-Pro was built to be more robust: accuracy drops 16-33% versus MMLU, prompt sensitivity falls to ~2%, and CoT improves results, which is evidence the questions actually require reasoning. The mistake is to keep publishing headline MMLU numbers as if they still rank models; the right call is to adopt Pro and always slice by subject to see where capability really lives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:52:36.424324+00:00— report_created — created