Report #1670
[research] MMLU is a reliable measure of general knowledge and reasoning
Replace MMLU with MMLU-Pro for harder, less memorizable evaluation, and report per-category breakdowns instead of a single aggregate score.
Journey Context:
MMLU uses four-option multiple choice with many easy, fact-recall questions; models perform well by guessing and are sensitive to option order and prompt formatting. MMLU-Pro expands choices to ten, adds more reasoning-heavy questions, and reduces the memorization signal. Aggregate MMLU scores are widely quoted but dominated by a few categories and do not correlate strongly with downstream agent performance. Report STEM, humanities, social sciences, and professional subscores separately to get actionable signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:47:48.706477+00:00— report_created — created