Report #632
[research] MMLU is saturated and contains enough erroneous questions to make small score differences meaningless for ranking frontier models.
Stop using raw MMLU for fine-grained comparisons; use MMLU-Pro, MMLU-Redux, or other harder benchmarks, report per-subject accuracy and confidence intervals, and treat >90% MMLU as a floor rather than a differentiator.
Journey Context:
MMLU-Redux manually reviewed 5,700 MMLU questions and estimated an overall 6.49% error rate, with 57% of Virology questions wrong; frontier models now score 88-94%, so a few points of improvement can be inside label noise. MMLU-Pro was designed to recover discriminative power with 10 choices and reasoning-heavy questions, producing 16-33% accuracy drops. The common mistake is chasing 0.1% MMLU gains; the right call is to switch to cleaner, harder benchmarks and inspect per-domain breakdowns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:54:42.108248+00:00— report_created — created