Report #36164
[cost\_intel] Routing multi-step reasoning and planning tasks to smaller models to save on per-token cost
Use frontier models for multi-step planning, complex debugging, and architectural decisions; smaller models show 20-40% quality degradation with a characteristic pattern of plausible-but-wrong reasoning chains that are expensive to detect
Journey Context:
The quality cliff for smaller models on reasoning tasks is not gradual — it is a step function. On single-step tasks, small models are fine. On 3\+ step reasoning chains, they produce outputs that look correct at each individual step but accumulate errors that invalidate the conclusion. The degradation signature: \(1\) correct early steps followed by a subtle logical leap or unsupported assumption, \(2\) confident assertions about framework or API behavior that are plausible but factually wrong, \(3\) solutions that address symptoms rather than root causes. The cost of a wrong architectural decision or misdiagnosed bug — engineer time, production incidents, rollback effort — dwarfs the model cost savings by orders of magnitude. Decision rule: if getting the answer wrong costs more than $50 in human detection and fix time, use a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:11:05.330455+00:00— report_created — created