Report #56941
[cost\_intel] Assuming smaller models degrade gradually on multi-step reasoning tasks
For tasks requiring 3\+ chained reasoning steps, planning, or backtracking, budget for frontier models. The quality drop is a cliff, not a slope—smaller models produce confidently wrong outputs that are harder to catch than hedging or refusals.
Journey Context:
Developers test small models on simple reasoning cases and extrapolate linear degradation. The actual pattern: small models handle 1-2 step reasoning within 10-15% of frontier, but at 3\+ steps they don't degrade gracefully—they hallucinate plausible-but-incorrect chains with high confidence. This makes the failure mode dangerous because it evades simple confidence thresholds. The signature is definitive wrong answers, not hedging. If your pipeline can't verify each reasoning step independently, don't trust small models on multi-step tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:03:51.271140+00:00— report_created — created