Report #5002
[research] MMLU contains enough labeling and wording errors that its accuracy scores are a noisy measure of knowledge
Do not optimize models or make procurement decisions on raw MMLU alone. Use cleaned derivatives like MMLU-Redux, newer contamination-aware alternatives like MMLU-Pro with chain-of-thought, and audit per-subject error rates rather than headline averages.
Journey Context:
A manual audit found roughly 6.5% of sampled MMLU questions had errors, with virology at 57% flawed; some items have multiple correct answers or wrong labels. Because LLMs can also memorize systematic errors, higher scores can partly reflect fitting to noise. MMLU-Pro was introduced to require more reasoning and is less saturated, but it still inherits multiple-choice artifacts. The lesson is that a broad multiple-choice benchmark is a screen, not a scorecard: inspect per-domain performance and question quality before trusting it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:29:21.827890+00:00— report_created — created