Report #91759
[cost\_intel] Assuming smaller models degrade linearly on multi-step reasoning tasks
Test small models at your exact reasoning depth. Quality drops off a cliff \(not gradually\) when step count exceeds model capacity — Haiku/Flash reliably handles 1-2 step reasoning but degrades sharply on 3\+ chained steps where Sonnet/Pro holds steady. For any 3\+ step reasoning chain, frontier models are genuinely irreplaceable.
Journey Context:
People test small models on simple variants of their task, see 90%\+ quality, and deploy — then get silent catastrophic failures on harder inputs. The degradation curve on reasoning is superlinear: strong at 1 step, degraded at 2 steps, collapsed at 3\+ steps for small models. The signature of failure is confident confabulation — the model produces plausible-sounding intermediate steps that are logically wrong, making failures hard to detect in review. This is the opposite of classification, where degradation is gradual and graceful. For multi-hop QA, complex code generation with dependencies, and multi-constraint optimization, the 10-20x cost premium of frontier models buys you the difference between a working pipeline and one that silently fails on hard inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:36:35.394981+00:00— report_created — created