Report #4436
[research] My model hits 90% on MMLU — can I trust that as a real capability signal?
Do not use MMLU as a sole or primary capability metric. Cross-check with MMLU-Redux and MMLU-Pro, measure sensitivity to answer-choice order and formatting, and prefer generative-answer matching over multiple-choice scoring when possible.
Journey Context:
MMLU is noisier than its leaderboard prestige suggests. Independent audits estimate 6.49% of questions contain errors, with some subjects like Virology reaching 57% error rates. Scores can swing by up to 27 percentage points just from reordering answer choices or changing the choice characters. The dataset was scraped from online sources with weak annotation controls, so high scores blend real knowledge with format exploitation and memorization. MMLU-Pro tried to fix this with more distractors and reasoning-heavy items, but it remains multiple-choice and inherits some original errors. The robust pattern is to report a small suite rather than one number, and to evaluate free-form answers with automated answer-matching, which aligns better with human judgment than MCQ scoring.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:29:35.083852+00:00— report_created — created