Report #1034
[research] MMLU scores are noisy and often misleading because the benchmark contains erroneous questions and is prone to saturation and contamination.
Use a re-annotated successor such as MMLU-Redux \(or MMLU-Pro / MMLU-CF\) for model comparisons, and for any custom multiple-choice eval keep the test set closed, run exact-match 8-gram leakage checks, and independently re-annotate a stratified sample.
Journey Context:
MMLU-Redux manually reviewed 5,700 questions across all 57 subjects and estimated 6.49% of MMLU questions are erroneous, with up to 57% error rate in Virology; re-evaluation changed model rankings. Separately, TS-Guessing showed GPT-4 predicts masked MMLU choices at 57% exact match, suggesting memorization. MMLU-Pro and MMLU-CF were introduced partly to address saturation and contamination. Do not treat published MMLU scores as ground truth; always check which variant was used and whether the test set was held out.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.187400+00:00— report_created — created