Report #2399
[research] Which code evaluation harness should I trust for agent/model comparisons?
Use SWE-bench-Lite or SWE-bench Verified for real-world repository editing; use LiveCodeBench for fresh, contamination-resistant coding problems; use BigCodeBench for diverse API usage; and keep HumanEval/MBPP only as a cheap smoke test. Report pass@1 with a fixed budget \(temperature 0.2, max tokens\) and never compare numbers across different harness versions.
Journey Context:
HumanEval is the most cited benchmark but it is saturated and easily contaminated, so a model scoring 90% on HumanEval tells you almost nothing about its ability to fix a real GitHub issue. SWE-bench is harder and more realistic, but the full set is expensive and has seen data leakage; SWE-bench-Lite and Verified were created to address that. LiveCodeBench is valuable because problems are continuously updated, making it a better signal of current capability. BigCodeBench tests tool/API use rather than just algorithmic snippets. The biggest methodological error is cherry-picking the easy subset or reporting pass@k without specifying k and sample budget. Always run with the official harness and dockerized evaluation — 'I ran the questions manually' introduces huge grading variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:52:43.033494+00:00— report_created — created