Report #877
[research] MMLU scores are saturated and inflated by memorization plus annotation errors
Use MMLU-Pro, LiveBench, or domain-specific dynamic benchmarks for discriminating capability, and run contamination checks \(e.g., TS-Guessing\) before trusting public-benchmark numbers. Report per-subject breakdowns and prompt-format variance rather than a single aggregate score.
Journey Context:
MMLU has become a ceiling benchmark: top models approach human-level accuracy, questions are often memorizable from pre-training data, and an estimated ~6.5% of items contain errors \(spiking to ~57% in niche subsets like Virology\). MMLU-Pro was designed to fix this with harder questions, more answer choices, and chain-of-thought-friendly prompts, dropping top-model accuracy by 16-33 percentage points. The deeper issue is treating a static, public multiple-choice test as a current capability measure years after release.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.838152+00:00— report_created — created