Report #100025
[cost\_intel] High reasoning effort on SWE-bench gives 3.5x cost for only 8 percentage points more solved
Use medium or low reasoning effort for agentic coding, and add a verifier or pass@K selection rather than defaulting to high effort. On SWE-bench Verified, o1 high effort reached 29.1% at $1,400 while low effort reached 21.0% at $400. Sampling a few medium-effort solutions and picking the least over-thought one reaches 30.3% at $1,200.
Journey Context:
The danger of overthinking paper shows that simply cranking reasoning effort to high is a poor cost-quality tradeoff in real-world coding. High effort produced a 3.5x cost increase for an 8.1pp accuracy gain on SWE-bench Verified. More importantly, selecting among a few samples based on overthinking score let them beat the high-effort baseline at lower cost. The implication: spend the budget on search and selection, not on a single high-effort trace. The signature of misallocated effort is long internal monologues that repeat what tool outputs already revealed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:27:28.709883+00:00— report_created — created