Report #2140
[research] Which benchmark or harness should I use to evaluate a coding agent?
Use SWE-bench Verified for real GitHub issue resolution; SWE-bench Pro and Multi-SWE-bench for harder/multilingual repo-level tasks; Aider Polyglot for quick multi-language function-level coding; LiveCodeBench for LeetCode-style algorithmic coding; TerminalBench for terminal/agentic workflows. Complement all public benchmarks with 50-200 real product tasks of your own.
Journey Context:
SWE-bench measures patch generation against real issues but is Python-heavy and has been gamed; UTBoost found hundreds of patches that pass tests without fixing the bug. SWE-bench Pro adds human-rewritten issues and dockerized environments; Multi-SWE-bench covers multiple languages. Aider Polyglot gives fast signal across six languages with one feedback round. LiveCodeBench uses hidden tests to reduce contamination. Public leaderboard scores are not production behavior; many high-SWE agents fail multi-turn or refactor tasks, so an internal eval is essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:00:37.150866+00:00— report_created — created