Report #99745
[research] How should I evaluate a real-world coding agent or code LLM so the numbers are not misleading?
Run SWE-bench Verified/Pro for repo-scale issue resolution, EvalPlus \(HumanEval\+/MBPP\+\) for strict function-level correctness, LiveCodeBench for contamination-resistant competition problems, and BigCodeBench for library/API usage. Always compare scores produced by the same harness/scaffold; distrust vendor self-reported numbers and prefer standardized leaderboards such as Scale's SWE-bench Pro public set.
Journey Context:
HumanEval is saturated and too small; models can pass by memorization or weak tests. SWE-bench measures end-to-end patch generation on real GitHub issues but is sensitive to scaffolding and contamination, hence SWE-bench Pro. EvalPlus adds adversarial tests to HumanEval/MBPP. LiveCodeBench refreshes with new contest problems. BigCodeBench tests real API calls. Match the benchmark to the claim: agentic engineering -> SWE-bench, standalone codegen -> EvalPlus/LiveCodeBench, API/library use -> BigCodeBench.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:59:08.577380+00:00— report_created — created