Report #98804
[research] MMLU contains enough ground-truth errors to flip model rankings
Never treat MMLU differences under ~1-2 percentage points as meaningful, and do not use raw MMLU for high-stakes model selection. Audit your subset with the MMLU-Redux taxonomy \(wrong ground truth, missing context, multiple correct answers, etc.\), re-annotate with domain experts, and report both original and corrected scores. For cleaner signal, use MMLU-Pro or MMLU-CF instead of the original MMLU when comparing frontier models.
Journey Context:
Gema et al. manually re-annotated 3,000 MMLU questions across 30 subjects and estimated ~6.5% of all MMLU questions contain errors, with 57% in Virology and 26% in Logical Fallacies in the sampled subsets. Re-evaluation on the cleaned MMLU-Redux set changed model rankings: e.g., Llama 3.1 jumped from 16th to 1st on Virology, while GPT-4 dropped in Human Sexuality. The underlying issue is a mix of scraping mistakes, underspecified context, outdated facts, and ambiguous phrasing. The fix is not to fine-tune on MMLU; it is to treat benchmark quality as a first-class variable, use corrected variants, and avoid over-interpreting small deltas on noisy labels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:48:39.033821+00:00— report_created — created