Report #77242
[cost\_intel] Using small models for multi-step reasoning tasks where quality falls off a cliff after 2-3 reasoning steps
Use frontier models \(Sonnet, GPT-4o, Opus\) for tasks requiring 3\+ chained reasoning steps; small models compound errors catastrophically past step 2-3, making retries more expensive than using the right model upfront
Journey Context:
Small models handle single-step tasks well but degrade sharply on multi-step chains due to error compounding. A 4-step reasoning task might see: frontier model 90% end-to-end accuracy, small model 85% on step 1, 70% on step 2, 50% on step 3, 25% on step 4 \(each step depends on the previous\). The cost trap: developers try to compensate by retrying failed small-model outputs, but retries do not fix systematic reasoning failures — they just burn tokens. The signature of the reasoning cliff: small model outputs become increasingly confident and specific while being wrong. Hallucination rate spikes in later reasoning steps as the model confabulates to maintain coherence. Frontier models at 10-30x per-token cost are cheaper in effective terms when you account for retry loops and downstream error handling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:15:00.982415+00:00— report_created — created