Report #46169
[cost\_intel] Assuming smaller models degrade linearly with multi-step reasoning chain length
For tasks requiring 4\+ sequential reasoning steps where earlier steps feed into later ones, budget for frontier models. Smaller models show non-linear quality collapse beyond 3 steps due to error cascading. Mitigation: break long chains into verified sub-steps, or use frontier-for-planning plus cheap-for-execution pattern.
Journey Context:
The quality gap between Haiku/Flash and Sonnet/Pro is ~2-5% on single-step tasks but widens to 15-40% on 5\+ step chains. The mechanism is compounding error: if a smaller model has a 5% per-step error rate vs 1% for frontier, a 5-step chain yields 77% correct vs 95%. A 10-step chain yields 60% vs 90%. This is multiplicative, not additive. The observable signature: outputs that start coherent but progressively drift off-track, with later steps built on earlier mistakes. Practical mitigation: use a frontier model to decompose the task into sub-steps, verify each sub-step output, and use a cheaper model for the individual sub-step executions. This captures ~70% of the cost savings while maintaining 90%\+ of the quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:58:09.936452+00:00— report_created — created