Report #2540
[research] How do I actually evaluate a coding agent or model?
Use SWE-bench for real repository-level issue resolution; SWE-bench Verified for a clean 500-instance subset. Use LiveCodeBench for contamination-resistant competitive programming. Use BigCodeBench for practical multi-library function synthesis. Use HumanEval/MBPP only as smoke tests. Always pair pass@1 with execution-based verification, not just syntax checks.
Journey Context:
HumanEval is saturated and easily overfit; models can score >90% while failing real tasks. SWE-bench measures end-to-end issue-to-patch correctness but is expensive and containerized. LiveCodeBench continuously adds new contest problems to fight contamination. BigCodeBench tests library usage and instruction following across 43 languages. The common mistake is optimizing for a single easy benchmark; a robust eval suite combines repository-level, function-level, and execution-prediction tasks, run under the same scaffold you deploy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:53:22.407775+00:00— report_created — created