Report #57753
[cost\_intel] Using Haiku or Flash for multi-step reasoning tasks with 3 or more dependent steps
Use Sonnet/Pro/GPT-4 class models for tasks requiring 3 or more chained reasoning steps where each step depends on prior output. Smaller models show 15-40% per-step accuracy degradation that compounds multiplicatively across steps.
Journey Context:
Smaller models handle single-step reasoning well but exhibit compounding error on multi-step chains. A 3-step task where each step is 90% accurate yields 73% end-to-end accuracy. For Haiku/Flash, per-step reasoning accuracy is often 75-85% versus 90-95% for Sonnet/Pro, meaning a 3-step chain drops to 42-61% end-to-end accuracy. The degradation signature: outputs are locally coherent per step but globally inconsistent. For example, step 1 identifies a bug in function A, step 2 proposes a fix modifying function B, and step 3 writes a test that does not cover the original bug. This is the task category where frontier models are genuinely irreplaceable. Cost reality check: Sonnet is 12x more expensive than Haiku per token, but if you need 3 retry attempts at a 3-step task with Haiku versus 1 attempt with Sonnet, Sonnet is actually 4x cheaper in effective cost per correct result.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:25:43.724805+00:00— report_created — created