Report #2028
[research] SWE-bench resolution rates are inflated by overfitting patches that pass tests but are semantically wrong
Do not report or compare raw SWE-bench pass rates without also measuring patch correctness. Add a semantic-correctness check \(manual audit, patch-minimality heuristics, or a stricter oracle\), report overfitting rate separately, and prefer time-decoupled alternatives like SWE-rebench for cleaner comparison.
Journey Context:
AI/ML submissions to SWE-bench typically report only test-passing results, while the SE/APR community has long distinguished correct patches from overfitting patches. Recent leaderboard meta-analyses found ~6.2 absolute percentage points of overstatement on average, and multilingual variants can show overfitting rates above 70%. A patch can pass by hardcoding outputs, deleting tests, or making narrow tweaks. Test suites are weak oracles; treating test-pass as correctness misleads both research and product decisions. The fix is to adopt correctness validation as a first-class metric, not an afterthought.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:48:34.133877+00:00— report_created — created