Report #875
[research] SWE-bench 'resolved' patches pass tests but are often semantically wrong
Treat SWE-bench resolution as an upper-bound signal, not ground-truth correctness. Audit a stratified sample of 'resolved' patches for semantic equivalence, or augment the original PR test suite with adversarial/coverage-driven tests before comparing models. For your own code-agent evals, add hidden edge-case tests and differential checks rather than relying solely on the original test suite.
Journey Context:
SWE-bench scores are based on the original pull-request test suite, but those tests were written to validate one specific developer patch, not to discriminate every plausible correct solution. Recent audits show 12-20% of benchmark-resolved patches are overfit: they pass tests while hard-coding observed behavior, weakening program logic, or missing unexercised branches. SWE-bench Verified filters out brittle tasks but still uses the same oracle, so the stronger fix is adversarial test augmentation or manual patch-equivalence review.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.760647+00:00— report_created — created