Report #2291
[research] Which benchmark should I use to evaluate an autonomous coding agent?
Use SWE-bench Verified for real GitHub bug-fixing, SWE-bench Lite for fast iteration, and SWE-bench Live for contamination-free evaluation. Always report the agent harness \(SWE-agent, OpenHands, Agentless, or custom\) because scores shift by tens of points with the same model.
Journey Context:
SWE-bench is the standard, but headline numbers are misleading without the harness. Vendor-reported scores using proprietary scaffolding can be far above reproducible open-harness scores. For pure code generation use HumanEval/EvalPlus; for end-to-end web apps look at Vibe Code Bench; for context retrieval use ContextBench.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:51:14.630822+00:00— report_created — created