Report #87213
[cost\_intel] Routing multi-step reasoning tasks to small models based on per-step simplicity
Use frontier models \(Opus, o1, GPT-4o\) for any task requiring 3\+ chained reasoning steps, cross-referencing between document sections, or maintaining implicit state across steps. Small models degrade non-linearly: 95% on 1-step drops to 70% at 2 steps and 40% at 3\+ steps. Frontier models degrade near-linearly: 98% to 93% to 88%.
Journey Context:
Engineers evaluate per-step difficulty and conclude each step is trivial, so a small model suffices. But errors compound multiplicatively across steps, and small models have higher per-step error rates to begin with. A 5% per-step error becomes 23% failure at 5 steps for small models. Frontier models maintain cross-step coherence because they track intermediate conclusions implicitly. The cost multiplier of 10-15x for frontier models is justified when the alternative is a pipeline that fails on nearly a quarter of multi-step inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:58:33.368204+00:00— report_created — created