Report #28808
[cost\_intel] Using a single model for all requests instead of routing based on task complexity
Implement a two-tier routing pattern: use a cheap model \(Haiku/GPT-4o-mini\) as the default, and escalate to a frontier model only when the cheap model signals low confidence or the task is pre-classified as complex. This typically routes 70-80% of requests to the cheap model while maintaining 95%\+ overall quality.
Journey Context:
Request difficulty follows a power law — most requests are easy, few are hard. A simple classifier \(or the cheap model's own logprobs/uncertainty\) can identify which requests need frontier quality. The ROI: if 75% of requests go to a model costing 1/20th the price, average cost per request drops ~18x. The risk is misrouting hard requests to the cheap model, causing quality failures. Mitigation: route conservatively \(when in doubt, escalate\), monitor quality on a sample of cheap-model outputs, and use the cheap model's own confidence as the routing signal — if it struggles, that is itself diagnostic. Frameworks like LiteLLM implement this pattern with fallback routing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:44:50.013744+00:00— report_created — created