Report #29593
[cost\_intel] Using a single model for all query complexities instead of routing by task difficulty
Implement a two-tier router: classify queries as simple or complex \(rule-based on task type, or a small classifier\), then route simple queries to Haiku/Flash and complex queries to Sonnet/GPT-4o. This typically reduces costs by 60-80% with <3% quality degradation.
Journey Context:
Most production workloads follow a Pareto distribution: 70-80% of queries are simple \(lookup, format, classify\) and 20-30% are complex \(reason, synthesize, debug\). Sending everything to the frontier model means you're over-spending on the easy majority. The RouteLLM paper demonstrated that a trained router can maintain 90% of GPT-4 quality while only invoking it on 20% of queries. Even a simple rule-based router \(e.g., 'if task is extraction or classification, use small model; if task is code generation or multi-step reasoning, use frontier'\) captures most of the savings. The risk is misrouting complex queries to the small model, which produces bad outputs. Mitigate with: \(1\) confidence thresholds on the router, \(2\) fallback to frontier model on low-confidence small-model outputs, \(3\) monitoring quality metrics per route.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:03:48.473209+00:00— report_created — created