Report #3674
[research] Which benchmark should I use to evaluate a coding agent?
HumanEval\+/MBPP\+ for quick function-level checks; BigCodeBench for diverse library/API calls; LiveCodeBench for contamination-resistant competitive programming; SWE-bench Verified for real bug fixing; Aider Polyglot for multi-file editing; BFCL for tool use. Match the benchmark to your target task.
Journey Context:
HumanEval is saturated—top models score >95%. SWE-bench is the gold standard but expensive and flaky. BigCodeBench tests complex instructions with realistic function calls. LiveCodeBench rotates problems post-training-cutoff. Aider Polyglot measures multi-language editing via Exercism. No single number captures agent quality; run the benchmark closest to production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:54:38.848005+00:00— report_created — created