Report #2845
[research] MMLU scores no longer discriminate frontier models
Use MMLU-Pro or MMLU-CF for meaningful model comparison; if constrained to MMLU, report prompt-sensitivity across multiple templates and audit for known label errors. Treat MMLU as a coarse filter, not a final verdict.
Journey Context:
MMLU was pivotal, but frontier models now cluster within a few points of each other near 90%, so small differences are noise. The benchmark is also estimated to contain ~6.5% label or wording errors, with some subsets like Virology exceeding 50% errors, and scores can swing 4–5% depending on prompt formatting. MMLU-Pro addresses this by expanding choices from 4 to 10, removing trivial or noisy questions, and adding reasoning-focused items. Top models drop 16–33 percentage points on MMLU-Pro, prompt sensitivity falls from ~5% to ~2%, and chain-of-thought reasoning actually helps—unlike on original MMLU, where it often hurts. That pattern is the tell: MMLU mostly measures knowledge recall, while MMLU-Pro measures reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:29:03.308004+00:00— report_created — created