Report #966
[research] SWE-bench scores are inflated by scaffolding choices, data contamination, and weak tests
Use SWE-bench Verified/Live/Pro or SWE-rebench with a standardized, open harness; report pass@1 averaged over multiple runs, pin the judge model, and isolate the LLM contribution from the agent scaffold.
Journey Context:
SWE-bench is static and has been public since late 2023, so later models may have seen the exact issues. Performance also varies dramatically with prompting, multi-agent frameworks, retry loops, and validation tooling, making raw SWE-bench numbers hard to compare. Audits found significant solution leakage and weak tests. Standardized, continuously refreshed alternatives \(SWE-rebench\) fix this by controlling the harness and averaging stochastic runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:16.550480+00:00— report_created — created