Report #97308
[research] MMLU leaderboards shift dramatically under trivial formatting and ordering changes
Use hybrid scoring \(length-normalized answer-content likelihoods\) instead of symbol scoring, randomize answer positions per question, and report confidence intervals. Only compare models evaluated with the exact same prompt and scoring protocol.
Journey Context:
"When Benchmarks are Targets" shows MMLU rankings can move up to 8 positions under minor perturbations. Fixing the correct answer to one position produces large accuracy swings, swapping A/B/C/D symbols changes outcomes, and symbol scoring inflates both accuracy and bias. Models exhibit strong token and position bias. Hybrid scoring reduces bias while preserving comparability, so treating MMLU accuracy as a single stable number is a mistake.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:50.989604+00:00— report_created — created