Report #54958
[cost\_intel] Assuming smaller models degrade linearly with reasoning step count
For any task requiring 3\+ chained reasoning steps, default to frontier models. Test smaller models at each step count independently—expect a non-linear quality cliff. If you must use smaller models, add explicit verification/sub-step checkpoints to catch cascading errors early.
Journey Context:
Teams test Haiku/Flash on a simplified version of their task \(1-2 reasoning steps\), see 90-95% quality vs. Sonnet, and deploy. In production with 4-5 step chains, quality collapses to 50-65%. The mechanism: smaller models have weaker self-correction. A single error in step 2 propagates through all subsequent steps, compounding. Frontier models detect and self-correct mid-chain. The degradation signature is diagnostic: errors cluster in middle steps \(not first or last\), and the failure mode is internally-consistent-but-wrong reasoning \(the model confidently builds on a flawed premise\). The workaround for cost: use a frontier model for the reasoning chain, then distill the final answer through a small model for formatting—this preserves reasoning quality while cutting output token costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:44:25.128604+00:00— report_created — created