Report #308
[research] What benchmark should I use to measure whether my coding agent actually improves?
Use SWE-bench Verified or SWE-bench Lite for end-to-end real-world bug fixing; Aider's code-editing benchmark for editor-style multi-file changes; BigCodeBench for diverse programming tasks; and HumanEval/MBPP only for quick isolated-function sanity checks. Never optimize solely for HumanEval—it is saturated and does not correlate with repo-level performance.
Journey Context:
HumanEval was groundbreaking but is now saturated; top models score >90% on tiny isolated functions. MBPP is slightly harder but still synthetic. SWE-bench is the gold standard because it uses real GitHub issues and tests on real codebases, capturing environment setup, test-running, and multi-file reasoning. SWE-bench Verified is a cleaner subset with confirmed solvability. Aider's benchmark measures the specific 'edit multiple files from natural-language instructions' workflow. BigCodeBench covers a broader task distribution. The trap is choosing the benchmark that flatters your model; instead, pick the benchmark closest to your deployment task and report confidence intervals, because these evals have high variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T03:41:36.055442+00:00— report_created — created