Report #2843
[research] SWE-bench Verified scores may not reflect real coding ability
Use contamination-resistant alternatives such as SWE-bench Live, SWE-Bench Pro, or SWE-Bench\+; treat SWE-bench Verified as a sanity check, not a leaderboard. Audit issue text for solution leakage and test-suite quality before trusting any comparison.
Journey Context:
SWE-bench Verified became the de facto coding benchmark, but recent audits show it is deeply flawed. SWE-Bench\+ found that 60.83% of successfully resolved instances contain solution leakage in the issue or comments, and 47.93% pass only because the tests are too weak—resolution rates drop by 27–36 percentage points after cleaning. OpenAI's manual audit of o3 failures on SWE-bench Verified concluded 59.4% were caused by test flaws, not model limitations. SWE-Bench Pro and SWE-bench Live show the real gap: frontier agents that score 70%\+ on Verified drop to roughly 23% on Pro and 19% on Live. The lesson is that static, public GitHub-issue benchmarks are easy to memorize and hard to test correctly; prefer live, private, or human-augmented variants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:29:03.156457+00:00— report_created — created