Report #54092
[cost\_intel] How to route between cheap and expensive models dynamically — cascade routing for cost optimization
Implement a two-tier cascade: send every request to the cheap model first, then route low-confidence or validation-failing outputs to the frontier model. This handles 70-80% of requests on the cheap model while maintaining frontier quality on hard cases, reducing average cost 3-5x vs all-frontier. Use logprobs/confidence scores where available, or a lightweight classifier trained on past model disagreements.
Journey Context:
Task difficulty is not uniformly distributed—even within a single task type, 70%\+ of inputs are 'easy' \(clear-cut, unambiguous\) while 20-30% are genuinely hard. Routing everything to a frontier model over-provisions for the easy majority. The cascade pattern: \(1\) run cheap model with confidence output, \(2\) if confidence < threshold OR output fails structural validation, escalate to frontier model. The engineering challenge is obtaining a reliable confidence signal: OpenAI provides logprobs on some models, Anthropic does not. Alternatives: train a lightweight classifier on features like input length, lexical complexity, or historical model disagreement patterns. Rule-based routing also works: route inputs >N tokens or containing specific complexity markers directly to frontier. The cost math: if Haiku is 12x cheaper and handles 75% of traffic, average cost = 0.75 × \(1/12\) \+ 0.25 × 1 = 0.3125 of Sonnet-only cost—a 3.2x savings. The quality guarantee: every request either gets a high-confidence cheap-model answer or a frontier-model answer, so quality floor equals frontier quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:17:14.401407+00:00— report_created — created