Report #78327
[cost\_intel] Using GPT-3.5 for complex debugging tasks requiring 4\+ step reasoning, getting 40% accuracy vs 85% for GPT-4o, costing more in wasted engineer time than API savings
Reserve GPT-4o/Claude 3.5 Sonnet for tasks requiring >3 hops of reasoning \(debugging, root cause analysis, multi-file architecture decisions\); cheaper models show cliff-like degradation beyond 2-step chains
Journey Context:
Cost-quality curves are non-linear. Haiku/Flash are 90% of Sonnet/Pro on single-hop extraction but drop to 50% on 3-hop reasoning. The 'reasoning gap' is the irreplaceable frontier. Common error: A/B testing on simple tasks and assuming the ratio holds for complex tasks. The cost of a wrong answer \(engineer debugging time\) dwarfs the $0.01 API savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:03:59.954909+00:00— report_created — created