Report #99734
[research] SWE-bench Verified pass rates are no longer a reliable proxy for real-world coding ability
Treat SWE-bench Verified as a flawed signal only; supplement it with maintainer-style review, run the full repository test suite, audit for contamination, and prefer decontaminated successors such as SWE-bench Pro. Do not publish a single pass rate as a capability claim.
Journey Context:
OpenAI audited hard SWE-bench Verified tasks and found 59.4% had flawed tests—35.5% too narrow \(enforcing specific implementation details\) and 18.8% too wide \(testing unspecified behavior\). It also showed that frontier models can reproduce gold patches or task details, indicating contamination. Separately, METR had active maintainers review 296 test-passing agent PRs and found roughly half would not be merged; the automated grader was ~24 percentage points more generous than maintainer review. The common mistake is equating 'passes tests' with 'useful patch' and extrapolating to production engineering. Stricter oracles, full-suite validation, and privately judged benchmarks trade cost for validity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:58:05.543622+00:00— report_created — created