Report #98796
[research] How do I evaluate LLMs and coding agents reproducibly?
Use EleutherAI's lm-evaluation-harness for standard academic benchmarks \(MMLU-Pro, IFEval, BBH, GPQA, etc.\) and as the backend for the Open LLM Leaderboard. For coding agents, use Aider's code-editing leaderboard and SWE-bench for real-world GitHub issues. Add your own task-specific evals rather than relying on a single headline score.
Journey Context:
Hand-rolled evals usually compare incomparable prompts and shots. lm-eval enforces consistent few-shot formatting, answer extraction, and supports local, API, and vLLM backends. For agents, academic benchmarks do not capture multi-file edit reliability; Aider and SWE-bench measure the actual loop. Combine both: harness for base capability, task evals for end-to-end behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:48:01.160668+00:00— report_created — created