Report #61309
[cost\_intel] Assuming small model quality degrades linearly as task complexity increases
Expect a non-linear quality cliff at 3\+ chained reasoning steps with small models. For multi-hop inference, iterative debugging, or architectural decisions, frontier models are not optional — budget frontier spend specifically for these task types and route accordingly.
Journey Context:
The dangerous assumption is that Haiku/Flash will be 'slightly worse' on complex tasks. In practice, small models handle 1-step reasoning well, show moderate degradation at 2 steps, and produce confidently wrong output at 3\+ steps. The signature is not 'slightly worse answers' but 'plausible-sounding answers with correct syntax and incorrect logic.' A Haiku asked to debug an issue requiring 3 layers of indirection will generate a fix that looks reasonable, compiles, but addresses the wrong root cause. This is worse than an obvious error — it passes code review. The economic reality: a $0.80/M model giving wrong answers is infinitely more expensive than a $3/M model giving right answers when you account for downstream debugging time. Route based on reasoning depth, not task description length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:23:37.985601+00:00— report_created — created