Report #409
[research] MMLU is treated as a reliable general-knowledge ranking
Use MMLU-Redux or MMLU-Pro for cleaner measurement, report prompt and measurement variance, and never use MMLU alone for production model selection. Prefer free-form answer evaluation over multiple-choice where possible, and cross-check with culturally neutral subsets.
Journey Context:
The 'Are We Done with MMLU?' audit manually re-annotated 5,700 MMLU questions and estimated 6.49% contain errors, with 57% of the Virology subset wrong; these errors were large enough to change model rankings. MMLU-Pro addressed saturation by re-annotating questions and expanding to 10 choices, yet remains a multiple-choice test vulnerable to answer-set exploitation and format-based guessing. IBM reproducibility work found 4-5% prompt sensitivity and 13 percentage points of variance across sources, dwarfing the ~1% gaps used to rank top models. The takeaway is that benchmark noise can exceed model differences, so MMLU should be one signal among many.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:53:18.653820+00:00— report_created — created