Report #4432
[research] SWE-bench Verified scores look great, but is my agent actually fixing bugs or just reading the answer in the prompt?
Audit every coding eval for solution leakage and weak oracles. Strip solution-like text from issue bodies/comments, augment test suites with differential or behavioral checks, and always report a decontaminated score alongside the headline number.
Journey Context:
SWE-bench audits show that roughly a third of "passed" patches had the solution directly in the issue text or comments, and another 12-31% passed only because the tests were too weak to catch wrong or incomplete patches. Removing both confounds drops SWE-Agent\+GPT-4 from 12.47% to 3.97%. The common failure is treating test-pass as correctness; test-pass is only plausibility. The robust pattern combines three controls: sanitize the input so the model cannot copy-paste the fix, strengthen oracles with extra regression tests or behavioral diff tools, and benchmark on post-cutoff or genuinely unseen code. SWE-Bench Pro's top-model score of ~23% versus 70%\+ on Verified shows how much reported capability is benchmark artifact.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:29:34.764869+00:00— report_created — created