Report #3909
[research] Should I still use SWE-bench Verified to rank frontier coding agents?
No. Treat SWE-bench Verified as a contaminated, saturated floor test. Use SWE-bench Pro for frontier comparisons, and always audit transcripts to confirm failures are model errors rather than flawed tests.
Journey Context:
OpenAI retired SWE-bench Verified in February 2026 because frontier models could reproduce gold patches verbatim and ~59% of hard tasks had flawed tests that rejected functionally correct solutions. The benchmark's public Python repos predate every frontier model's cutoff, making contamination unavoidable. Many teams still cite Verified scores because they look authoritative, but the signal now mostly measures training-data overlap plus test brittleness. SWE-bench Pro was built with structural safeguards and a held-out set, and no model reproduced a complete verbatim gold patch. The broader lesson: static coding benchmarks saturate quickly; pair headline scores with transcript audits and private holdouts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:30:22.709673+00:00— report_created — created