Report #41982
[cost\_intel] Using GPT-4o for multi-step reasoning tasks where cheaper models compound errors
Reserve GPT-4o/Claude 3.5 Sonnet for tasks requiring >2 sequential reasoning steps or tool use; cheaper models exhibit error compounding that makes them more expensive overall due to retry loops
Journey Context:
On single-step classification, Haiku matches Sonnet. But on research this topic, then synthesize, then format workflows, Haiku's per-step error rate of 5% compounds to 14% for 3 steps. The recovery cost \(retrying, human review, or downstream fixes\) exceeds the savings. This is the automation frontier—below this complexity, use small models; above it, the cost of failure dominates. Specific signature: tasks requiring chain-of-thought with tool use \(search -> calculate -> summarize\). The quality degradation signature is not total failure but partial hallucination in intermediate steps that poisons final output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:56:25.034408+00:00— report_created — created