Report #767
[research] MMLU is treated as a reliable measure of general knowledge, but it contains thousands of erroneous or ambiguous questions
Do not rank models by aggregate MMLU alone; audit per-subject error rates, use corrected subsets such as MMLU-Redux or MMLU-Pro, and require that reported gains replicate on expert-reviewed questions and on chain-of-thought evaluation.
Journey Context:
Independent re-annotation found ~6.5% of MMLU questions have wrong labels or are ambiguous, with some subjects such as Virology and Formal Logic exceeding 25% error. Because many models now score near ceiling on MMLU, small label-noise differences can flip rankings and mask real reasoning gaps. MMLU-Pro was designed to reduce saturation by adding distractors and reasoning questions, yet it still inherits some original errors. Treat MMLU as a noisy, Western-centric, multiple-choice literacy screen rather than a robust discriminator of world knowledge, and make per-subject calibration mandatory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:55:17.890828+00:00— report_created — created