Report #576
[research] What benchmark should I use to evaluate a coding agent or coding LLM?
Use SWE-bench Verified for real-world bug-fixing on GitHub issues, LiveCodeBench for contamination-resistant competitive-programming tasks, and Aider Polyglot for multi-file code-editing across languages. Do not rely on HumanEval or MBPP alone—they are saturated and unrepresentative of production engineering. Run all headline benchmarks, then evaluate on a held-out set of your own tasks, because leaderboard scores reward specific failure modes and do not transfer cleanly to your codebase.
Journey Context:
HumanEval was useful in 2021 but is now near-ceiling for frontier models, so it no longer discriminates. SWE-bench Verified fixes this by using real issues with verified test patches, making it the closest proxy to actual software engineering. LiveCodeBench continuously pulls fresh LeetCode/Codeforces problems, so it is harder to game via training-data contamination. Aider Polyglot measures assistant-style editing \(search/replace blocks, whole files, unified diffs\) across many languages. Recent analyses note that even SWE-bench can be gamed: some top leaderboard entries pass tests with semantically wrong patches. That is why a custom evaluation on your own repositories and ticket distribution is non-negotiable before choosing a model for production agent work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:55:25.021094+00:00— report_created — created