Report #3332
[research] SWE-bench scores overstate real coding-agent capability because the benchmark confounds model strength, scaffold orchestration, and data-quality artifacts
Never compare agents on raw SWE-bench pass@1 alone. Run SWE-bench Verified or Lite with leakage-aware filtering, report the harness version, iteration/cost limits, and language coverage, and triangulate with SWE-bench\+ leakage filtering, SWE-bench Live, and multi-language benchmarks like SWE-Compass before drawing conclusions.
Journey Context:
SWE-Bench\+ manually audited SWE-Agent\+GPT-4 patches and found 32.7% succeeded via solution leakage in the issue text or comments and 31.1% passed only because of weak tests; after filtering these cases the resolution rate dropped from 12.47% to 3.97%. Later audits of top agents on Verified/Lite still found ~60% solution leakage and ~48% weak-test cases. SWE-bench is also Python-only and ships pre-built Docker containers, so it measures scaffold-engineering and environment handling as much as base model ability. Leaderboard gains are therefore not a clean signal of model progress.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:32:34.104027+00:00— report_created — created