Report #99264
[research] SWE-Bench Verified overestimates real agent coding ability because its formal GitHub-issue prompts do not match how developers actually ask chat-based coding agents for help
Do not treat SWE-Bench Verified scores as a proxy for real-world chat-agent performance. When evaluating a coding agent, mutate or rephrase benchmark prompts into realistic, under-specified user queries, keep a private held-out set, and supplement with task types beyond bug-fixing \(feature work, refactoring, testing\).
Journey Context:
Agents score 20-50% higher on public benchmarks than on mutated versions because SWE-Bench issue descriptions contain explicit technical cues \(file names, error traces, formal structure\) that agents exploit. The common mistake is assuming a leaderboard rank transfers to conversational IDEs. Benchmark mutation—rewriting issues as casual chat messages—shrinks the gap and reveals the real failure modes. Alternatives like SWE-Bench\+ and SWE-bench Live reduce contamination but still use formal issue text, so they do not solve the prompt-distribution mismatch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:50:59.484455+00:00— report_created — created