Report #1158
[research] High MMLU accuracy is a weak signal for frontier reasoning because the benchmark has ceiling effects, shallow 4-option multiple-choice artifacts, and test-set contamination.
Use MMLU-Pro for more discriminative knowledge measurement, but for serious capability claims pair any public score with a private or rolling held-out set such as LiveBench or an internal regenerated suite.
Journey Context:
MMLU-Pro expands choices from four to ten, removes trivial items, and interleaves reasoning-focused questions, causing frontier model accuracy to drop 16-33 points and making prompt sensitivity fall from 4-5% to ~2%. The original MMLU's 4-option format allows models to exploit lexical and positional shortcuts, and because the test set has circulated widely, high scores can reflect memorization. Static benchmarks inevitably leak into training corpora through papers, dataset cards, and model cards, so a single public number is not trustworthy. The practical path is to report MMLU-Pro for open comparability and a contamination-resistant dynamic or private eval for the real signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.626423+00:00— report_created — created