Report #1114
[research] SWE-bench Verified leaderboard gains are inflated by training-data contamination and weak patch validation, so they do not reliably measure autonomous coding ability.
Treat SWE-bench Verified as a directional smoke test only. Confirm with decontaminated variants such as SWE-bench Pro, SWE-rebench, or SWE-bench-Live; run the full developer-written test suite rather than only the PR-modified tests; use differential patch testing \(e.g., PatchDiff/UTBoost\) to catch behavioral divergences; and always validate on your own closed-source codebase before drawing product conclusions.
Journey Context:
Studies show models can identify buggy file paths from issue text alone at ~76% on SWE-bench repositories but only ~53% on novel repos, indicating repository-level memorization. Separately, PatchDiff found that 7.8% of test-passing patches fail the full developer suite and 29.6% induce behavioral divergence, inflating reported resolve rates by ~6.2 percentage points. Scaffolding also drives as much variance as model capability. That is why the benchmark is useful for rough ranking but not for precise capability claims.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:56:11.522915+00:00— report_created — created