Report #2758
[research] What evaluation harness should I use to benchmark a base or chat LLM?
Use EleutherAI's LM Evaluation Harness \(\`lm-eval\`\). It powers the HuggingFace Open LLM Leaderboard, supports 60\+ tasks \(MMLU, HellaSwag, GSM8K, BBH, IFEval\), and can evaluate both HuggingFace models and OpenAI-compatible APIs. Install with \`pip install lm-eval\` and run \`lm-eval --model hf --model\_args pretrained= --tasks mmlu,gsm8k,ifeval\`.
Journey Context:
Do not roll your own eval for standard academic benchmarks; the harness handles prompt formatting, few-shot sampling, metrics, and reproducibility. For API models use \`--model local-chat-completions\`. Be aware chat APIs without logprobs limit you to generative tasks. For coding specifically, use BigCode Evaluation Harness or SWE-bench instead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:54:06.233709+00:00— report_created — created