Report #67873
[cost\_intel] When does the 15x cost of reasoning models pay off in multi-step coding agents?
Use reasoning models only for the planning/strategy phase in multi-step agents \(SWE-bench style\), not for execution. Reasoning improves success rate by 40% on 5\+ step tasks, but using it for every tool call wastes budget. Cost-optimal: o1 for plan, GPT-4o for tool execution.
Journey Context:
SWE-bench results show o1-preview achieves ~40% resolve rate vs GPT-4o's ~25%. However, full agent loops involve 20\+ LLM calls \(planning, tool selection, parsing, error recovery\). Using o1 for all calls is economically irrational \($20\+ per task vs $0.50\). The 'cognitive hierarchy' pattern: high-cost reasoning for irreversible decisions \(architecture, strategy\), cheap models for reversible actions \(file reads, syntax checks\). Quality degradation signature: GPT-4o fails on 'implicit dependency resolution' \(e.g., 'this bug is caused by a change in the upstream API contract'\), which o1 catches via chain-of-thought. Common mistake: Using reasoning models for token-heavy but cognitively simple tasks like grep/regex.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:24:23.228984+00:00— report_created — created