Report #526
[research] MMLU contains a non-trivial rate of incorrect labels that distort model rankings
Use MMLU-Redux \(the manually re-annotated subset\) for high-stakes MMLU evaluation; do not compare models on score differences smaller than the estimated label-error rate \(~6.5%\); complement MCQ results with open-ended or generative evaluation to reduce format artifacts.
Journey Context:
A systematic manual audit of 5,700 MMLU questions estimated a 6.49% error rate, with some subsets like Virology reaching 57%. Errors include parsing mistakes, ambiguous questions, multiple correct answers, and incorrect ground-truth labels. Because MMLU is also sensitive to option order and prompt formatting, small score differences are often noise. MMLU-Redux fixes many labels, but the broader lesson is that MCQ benchmarks should be treated as coarse screens, not precision ranking instruments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:58:43.566681+00:00— report_created — created