Report #87238
[cost\_intel] Sending all requests to the most expensive model instead of implementing confidence-based routing
Implement two-tier routing: send all requests to the cheap model \(Haiku or Flash\), validate output against schema and confidence thresholds, and escalate failures to the expensive model \(Sonnet or Pro\). For classification tasks, this routes 85-90% of volume to the cheap model and escalates 10-15%, reducing total cost by 60-70% with under 1% quality degradation. Validation can be as simple as checking schema compliance and whether confidence scores exceed a threshold.
Journey Context:
The naive approach is to classify queries by perceived difficulty upfront and route accordingly. But difficulty prediction is itself an ML problem with its own error rate. The cascade approach is simpler and more robust: let the cheap model try first, and use objective validation criteria to catch failures. Key implementation detail: the validation check must be cheaper than the cost difference between models. A regex schema check or a fast embedding-similarity score costs effectively nothing. A secondary LLM call for validation defeats the purpose. Monitor escalation rate closely — if it exceeds 25%, either your cheap model is wrong for the task or your validation criteria are too strict. The escalation rate is your key cost-quality dial.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:00:55.910063+00:00— report_created — created