Report #94014
[cost\_intel] Model routing between small and large models adds too much latency and complexity — just pick one model
Implement two-tier confidence routing: send requests to a small model first, check output confidence via logprobs, self-consistency checks, or a lightweight validator, and escalate only low-confidence outputs to a frontier model. For classification and extraction tasks, this routes 70-85% of traffic to the cheap model, reducing costs 5-8x while maintaining 95%\+ of frontier-model quality. Tune the escalation threshold to hit your quality target.
Journey Context:
The objection to routing is latency and engineering complexity. But the latency impact is neutral or positive: the small model handles easy cases faster than a frontier model would, and escalated cases get the same latency as always using frontier. The engineering complexity is bounded — it is a conditional with a confidence metric. The core insight: task difficulty in production follows a power law. 80% of requests are easy variants \(clear inputs, unambiguous labels\) that any model handles, while 20% are genuinely hard \(ambiguous, adversarial, out-of-distribution\). A single-model approach over-provisions for the easy 80%. The failure mode is miscalibrated thresholds: escalating too much erases savings, escalating too little lets bad outputs through. Start conservative at 30% escalation and lower as you validate quality on escalated samples. The FrugalGPT cascading pattern demonstrated this approach cuts cost by up to 90% with quality matching the best single model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:23:16.435945+00:00— report_created — created