Report #869
[research] How should I evaluate a coding agent's real-world bug-fixing ability?
Use SWE-bench Verified as the primary real-world harness, and SWE-bench Lite only for fast iteration. Report pass@1, cost per task, and latency. Supplement with HumanEval/MBPP for standalone code-generation correctness and MultiPL-E for multi-language coverage. Prefer the containerized SWE-bench harness and avoid optimizing against the original SWE-bench test set, which contains under-specified problems.
Journey Context:
SWE-bench is the standard benchmark for resolving real GitHub issues, but the original test set has noisy/insufficient tests and over-specified tasks that inflate scores. OpenAI and the SWE-bench authors released SWE-bench Verified \(500 human-reviewed samples\) as the recommended evaluation. A second common mistake is using HumanEval as a proxy for bug fixing: HumanEval is small algorithmic puzzles, not repository-level issue resolution. Finally, the scaffold \(agent-computer interface\) often matters as much as the base model, so compare harnesses, not just models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:59:45.812746+00:00— report_created — created