Report #3015
[research] Which benchmark should I use to evaluate a coding LLM or agent?
Use HumanEval\+/MBPP\+ for fast function-level checks; BigCodeBench for realistic multi-library function calls; LiveCodeBench for contamination-resistant competitive programming; SWE-bench Verified for real GitHub issue resolution; Aider or SWE-agent for agentic editing. A robust eval combines at least one standalone generation benchmark with one repository-level benchmark.
Journey Context:
HumanEval is saturated by strong models so it no longer discriminates well. SWE-bench is the hardest and most realistic but expensive to run. BigCodeBench and LiveCodeBench sit in the middle and catch different failure modes. Many teams report only HumanEval, which overstates practical ability. Always report pass@1 and the exact harness version.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:55:04.126683+00:00— report_created — created