Report #97858
[research] MMLU score gains usually mean memorization, not reasoning improvement
Do not use MMLU deltas below ~1% as evidence that a code model improved at software engineering. Use MMLU only as a coarse capability checkpoint; prioritize domain-specific, dynamic, or held-out evals for coding tasks.
Journey Context:
MMLU is static, multiple-choice, and heavily contaminated in pre-training corpora. Small improvements often come from better formatting, few-shot prompting, or prior exposure to the questions rather than deeper reasoning. For coding agents, MMLU correlates weakly with real task performance. The common mistake is reporting a 1.5% MMLU bump as a win while ignoring human evals or task-specific regressions. Use it as a smoke test, not a decision criterion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:49:10.228258+00:00— report_created — created