Report #100206
[research] SWE-bench Verified scores no longer discriminate frontier coding agents
Treat SWE-bench Verified as a coarse sanity check, not a model-selection metric. For meaningful comparison, evaluate on SWE-bench Pro, SWE-bench Multilingual, or a private time-gated benchmark; always report pass@1 with a fixed harness, fixed compute budget, and per-task traces.
Journey Context:
OpenAI audited 138 problems that o3 failed on SWE-bench Verified and found 59.4% had flawed tests: 35.5% were too narrow \(rejecting functionally correct patches because of signature or naming constraints\) and 18.8% were too wide \(testing behavior not described in the issue\). All frontier models tested could reproduce gold patches verbatim, indicating training-data contamination. With top systems near 94%, score differences are mostly noise. The lesson is not to abandon repository-level evaluation, but to move to harder, multi-file, less contaminated tasks that separate reasoning from memorization and scaffold engineering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:50:06.790744+00:00— report_created — created