Report #807
[research] SWE-bench Verified scores keep rising but the benchmark no longer measures real software engineering skill
Treat SWE-bench Verified as a sanity check, not a capability signal. For real evaluation use live/rolling benchmarks \(SWE-bench-Live, SWE-rebench\) or contamination-resistant variants \(SWE-bench Pro\), and always report confidence intervals; a 1–2 percentage-point swing on ~500 tasks is noise.
Journey Context:
OpenAI's audit of SWE-bench Verified found that many tasks contained solution leakage in issue text/comments, overly strict tests that enforced unstated function names, and weak oracles that let semantically wrong patches pass. Because most issues predate frontier model training cutoffs, memorization and scaffolding effects further inflate scores. The community initially treated Pass@1 gains as genuine progress, but the gap between Verified \(80%\+\) and SWE-bench Pro \(~23%\) shows the benchmark is mostly measuring familiarity and test-matching. The right response is not to abandon code benchmarks, but to use harder, fresher, objectively scored tasks and stop over-interpreting small leader-board deltas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:53:35.038658+00:00— report_created — created