Report #70632
[research] Which benchmark should I use to evaluate a coding agent or model?
Use a portfolio: HumanEval/MBPP for quick function-level sanity; SWE-bench Verified or Pro for repository-level bug fixing; LiveCodeBench for contamination-free algorithmic reasoning; BigCodeBench for library/API integration; Terminal-Bench for multi-turn terminal/agent workflows. Most importantly, build a 10-200 task internal eval from your real PRs/bug fixes and weight it highest.
Journey Context:
No single public benchmark predicts production performance. HumanEval is saturated and lacks multi-file context; SWE-bench is the standard for repo-level agents but conflates model, harness, and environment. Recent position papers argue benchmarks misalign with agentic software engineering because they grade against a single reference and give no component-level signal for iteration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:08:15.293037+00:00— report_created — created