Report #809
[research] MMLU scores are saturated and inflated by multiple-choice artifacts
Use MMLU-Pro instead of MMLU for model comparison. It expands each question to 10 choices, filters easy and ambiguous items, and enforces balanced topic sampling, making gains harder to fake through elimination heuristics or shallow memorization.
Journey Context:
MMLU became the default knowledge benchmark, but top models now score near ceiling and the 4-option format rewards guessing, option-order artifacts, and surface pattern matching. MMLU-Pro was designed to fix this by increasing the number of distractors, requiring more discriminating reasoning, and improving quality control. The key takeaway is that when a benchmark saturates and the format itself becomes a shortcut, switch to a harder, better-calibrated successor rather than quoting tiny accuracy differences.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:53:39.618107+00:00— report_created — created