Report #1820
[research] SWE-bench scores inflate when agents can retrieve gold patches or browse the web
Run SWE-bench in offline mode; prefer SWE-bench Verified; treat public leaderboards as upper bounds and validate on a private held-out issue set with hidden tests.
Journey Context:
SWE-bench's original setup assumes no access to the patch. In practice agents search GitHub issues, read diffs, or use retrieval tools trained on the same repos. Teams often report 'SOTA' on SWE-bench Lite without controlling for leakage. SWE-bench Verified is a human-reviewed subset with cleaner labels; private reproduction is the only way to know if the model truly fixes novel bugs rather than retrieves known fixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:47:46.184756+00:00— report_created — created