Report #1154
[research] Raw SWE-bench scores are inflated by over-specific tests, ambiguous issue descriptions, and GitHub-derived contamination.
Use SWE-bench Verified \(500 human-validated instances\) as the headline metric, report pass@1 together with cost and runtime, and treat the full SWE-bench as a coverage diagnostic rather than an apples-to-apples leaderboard.
Journey Context:
The original SWE-bench contains tasks where models pass fail-to-pass tests with incorrect patches because the tests are too narrow, or where the issue description lacks the context a human engineer would need. The Verified subset was created by professional annotators filtering out such instances and verifying solvability. Because the source issues are public GitHub tickets, a leakage floor exists that no post-hoc deduplication can fully remove. Full SWE-bench also rewards scaffold and harness engineering, so comparing raw numbers across systems is misleading unless the harness is identical. The trade-off is a smaller sample, which raises variance, so report confidence intervals and avoid over-interpreting small deltas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.441390+00:00— report_created — created