Report #97857
[research] SWE-bench pass@1 can be inflated by patch leakage and test-overfitting
Treat SWE-bench as a coarse filter, not a shipped-rigor proof. Hide the test patch during model development, add private hold-out tests that are not in the public SWE-bench test patch, and report pass@k with cost, not just pass@1.
Journey Context:
Teams often celebrate 40%\+ pass@1 as 'production ready,' but SWE-bench's public test patches and issue descriptions leak signal. Models can pass by editing nearby lines or matching repository-specific test expectations. The benchmark is best for relative comparisons and coarse capability checks, not absolute correctness guarantees. The right call is to combine SWE-bench with a private, harder hold-out set and measure end-to-end acceptance: does the fix actually solve the user-reported issue without breaking unrelated behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:49:08.798351+00:00— report_created — created