Report #553
[research] SWE-bench pass rates overstate real bug-fix quality because PR-derived test suites accept semantically wrong patches
Do not report pass@1 alone. Augment evaluation with generated regression tests that cover patch-affected code and fail on plausible incorrect patches; require a patch to pass both the original tests and a strengthened adversarial suite \(coverage-driven augmentation \+ mutation testing\) before counting it as resolved.
Journey Context:
SWE-bench instances are built from real PRs, so their tests verify the original patch, not every possible correct fix. Researchers found that 19.7% of patches from top-30 leaderboard agents that passed SWE-bench Verified were semantically incorrect, and tightening tests dropped the top agent from 78.8% to 62.2%. The common mistake is assuming fail-to-pass on the bundled tests equals correctness; it ignores coverage gaps and semantic blind spots. Generated tests \(e.g., Otter\+\+, SWE-ABS\) improve discriminative power, though they add compute cost and can introduce false negatives. The right call is to treat the bundled test suite as a necessary but insufficient filter and explicitly measure test coverage of the changed behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:24.154378+00:00— report_created — created