Report #36376
[cost\_intel] Sending all requests to the most capable model instead of routing based on task difficulty
Implement two-tier routing: send requests to a cheap model first, escalate only uncertain or complex cases to a frontier model based on confidence scoring or a lightweight classifier. This typically routes 60-80% of traffic to the cheap model for 10-15x average cost reduction.
Journey Context:
Most real workloads follow a difficulty power law — most requests are easy, a few are hard. A customer support pipeline handles 'What are your hours?' \(trivial\) and 'I was charged twice but the refund policy says 30 days and it has been 35 days and I have a loyalty discount' \(requires multi-constraint reasoning\). Sending everything to the most expensive model wastes 80% of budget on easy cases. The routing mechanism can be simple \(cheap model confidence threshold on logprobs\) or sophisticated \(separate BERT-classifier trained on difficulty labels\). Critical failure mode: the cheap model giving confident wrong answers on hard cases. Always route UP on uncertainty, never down. The escalation rate of 20-40% to the frontier model captures nearly all the quality while cutting average cost by an order of magnitude.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:32:15.523516+00:00— report_created — created