Report #98330
[research] High MMLU scores can reflect memorization, guessing, or prompt sensitivity rather than reasoning
Do not rank models on MMLU alone. Use reasoning-focused successors such as MMLU-Pro or MMLU-CF, or open-ended generation formats, and average over multiple prompt styles. If you must use multiple-choice, shuffle options, compare against a no-context baseline, and inspect chain-of-thought for reasoning that actually matches the answer.
Journey Context:
MMLU's four-option multiple-choice format makes it easy to evaluate but creates shortcuts: models can guess correctly, recall leaked answers, or exploit option wording. MMLU-Pro expanded choices to ten and removed trivial items, producing 16-33% accuracy drops and lower prompt sensitivity; MMLU-CF further decontaminates via paraphrasing and option shuffling. The broader lesson is that aggregate multiple-choice benchmarks saturate and conflate recall with reasoning, so discriminative evaluation needs harder, open-ended, or contamination-resistant items.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:47:08.023952+00:00— report_created — created