Report #968
[research] MMLU contains wrong ground-truth labels and ambiguous questions that distort model rankings
Audit the subset you report with MMLU-Redux, or switch to MMLU-Pro and other expert-validated benchmarks; do not compare frontier models on raw MMLU alone.
Journey Context:
A manual re-annotation of MMLU found ~6.5% of questions erroneous overall and 57% erroneous in Virology, including wrong labels and missing context. Some erroneous items show higher model performance, suggesting memorization. Because frontier models saturate MMLU in the high-80s/low-90s, small error rates and contamination can change rankings, so use reviewed versions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:16.645890+00:00— report_created — created