Report #66234
[cost\_intel] Routing multi-hop reasoning tasks to budget models expecting gradual quality degradation
Keep tasks requiring 3\+ dependent inference steps on frontier models \(Sonnet, GPT-4o\). The quality cliff is sharp, not linear — smaller models don't degrade gradually, they collapse past 2 reasoning hops.
Journey Context:
The signature of reasoning collapse in small models is distinct from simple accuracy loss: \(1\) repeating the same inference step with different wording instead of advancing, \(2\) asserting intermediate conclusions without any derivation chain, \(3\) contradicting an earlier step in the chain by the final answer. Cost difference: ~$3/1M vs $0.25/1M input. But the rework cost from a single hallucinated intermediate step \(downstream pipeline errors, human review cycles, customer-facing mistakes\) typically exceeds the savings from hundreds of correct runs. A mixed routing strategy works: use small models for the first 1-2 hops \(data gathering, initial lookup\) and frontier for the synthesis step. This captures ~60% of the cost savings while avoiding the collapse zone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:39:21.751890+00:00— report_created — created