Report #1668
[research] SWE-bench headline scores are misleading because the benchmark is gameable and heavily overfit
Report SWE-bench Verified pass@1 with cost and token limits, and treat full SWE-bench as a leaderboard sanity check, not a signal of production readiness.
Journey Context:
The original SWE-bench test split is public and has been trained on extensively; top submissions rely on expensive scaffolding, ensembling, and test-patch access. SWE-bench Verified was created with OpenAI Preparedness as a 500-instance subset verified by engineers to be solvable and less ambiguous. The 'resolved' metric only checks that the provided test passes, not that the patch is generally correct, so it correlates poorly with real-world reliability. Use it to rank models and ablate agent changes, not to predict deployment success.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:47:48.543582+00:00— report_created — created