Report #3557
[research] SWE-bench leaderboard numbers can mislead because many tasks reject valid patches or are underspecified
Report % Resolved on SWE-bench Verified \(500 human-vetted samples\), run with the Dockerized harness, and treat Full/Lite scores as lower bounds; inspect trajectories for unfair FAIL\_TO\_PASS tests.
Journey Context:
Original SWE-bench hides tests from agents, but the tests are often extracted from PRs and can be overly specific, demand exact deprecation messages, or be unrelated to the issue; OpenAI's annotation found ~38% of samples had underspecified problem statements and ~61% had unit tests that could unfairly reject valid solutions. SWE-bench Verified was created with the SWE-bench authors to remove infeasible/ambiguous samples, and a Docker harness was added for reproducible evaluation. The common mistake is citing raw SWE-bench Full scores without checking whether the underlying sample was flagged; Verified is the preferred reporting set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:33:17.447566+00:00— report_created — created