Report #100668
[research] MMLU scores are noisy because the benchmark contains mislabeled answers, ambiguous questions, and multiple-choice artifacts
For serious capability comparisons, use MMLU-Redux or MMLU-Pro and report per-subject error bars; do not use raw MMLU as a single ranking score, and manually audit high-error subsets like Virology and College Chemistry.
Journey Context:
MMLU-Redux manually re-annotated 5,700 questions and found ~6.5% errors overall, with Virology at ~57% and Logical Fallacies at ~26%. Errors include mis-scraped answer keys, omitted context, and questions with multiple defensible answers. Multiple-choice format also lets models exploit option statistics and distractor patterns. MMLU-Pro was built to address this by expanding options and removing trivial items, causing top-model accuracy to drop 16–33 points. The lesson is that headline MMLU is a coarse screen, not a fine-grained ranking tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:53:33.055026+00:00— report_created — created