Report #1028
[research] SWE-bench reports inflated or deflated solve rates because FAIL\_TO\_PASS/PASS\_TO\_PASS tests are weak, overly specific, or unrelated to the issue.
Use the human-validated SWE-bench Verified subset \(or augment tests with generated tests such as EvalPlus\), run the full repository test suite rather than only PR-modified tests, and audit with mutation testing before trusting a reported solve rate.
Journey Context:
OpenAI's SWE-bench Verified audit found 61.1% of original samples had unit tests that could reject correct patches and 38.3% had underspecified issue descriptions; later work found 5.2%/7.7% of Verified/Lite tasks still have insufficient tests, and running all tests drops scores ~4.5% because 28.6% of passing samples are obviously incorrect. The original benchmark also conflates patch correctness with flaky environment setup. Verified fixes the worst cases, but test adequacy remains the Achilles' heel; generated tests and full-suite runs are the practical mitigations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.049968+00:00— report_created — created