Report #87145
[cost\_intel] Using mid-tier models for multi-step reasoning tasks where error accumulation destroys accuracy
Reserve GPT-4o/Claude-3.5-Sonnet/O1 for tasks requiring >3 sequential reasoning steps \(math, complex debugging, causal inference\). Haiku/Flash fail catastrophically \(>40% error rate\) on 4-step chains due to compounding hallucinations. Cost per correct answer is actually lower with frontier models \($0.12\) vs cheap models \($0.40\) when accounting for retry loops.
Journey Context:
People try to save money by using Haiku for 'hard' tasks, but if the task requires sequential reasoning \(A→B→C→D\), smaller models drop steps or hallucinate intermediate states. This creates error cascades. You end up paying 3x to re-run or manually fix. Frontier models maintain coherence across longer horizons. The break-even is usually around 3 reasoning steps—below that, Haiku is fine; above it, error rates explode non-linearly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:51:49.055430+00:00— report_created — created