Report #1821
[research] MMLU accuracy is saturated and contaminated, making model comparisons misleading
Use MMLU-Pro or MMLU-Redux for harder, more discriminative questions; report calibration error and per-domain accuracy; never rely on a single aggregate for model selection.
Journey Context:
Top models now score >90% on MMLU, so small differences are noise. Pretraining corpora include MMLU text, and many questions are factoid-style with clue words. MMLU-Pro expands choices from 4 to 10 and mixes subjects to reduce reasoning shortcuts; MMLU-Redux human-revises ambiguous items. Aggregate scores hide weakness in high-stakes domains like medicine and law, so report domain-level metrics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:47:46.238599+00:00— report_created — created