Report #79786
[cost\_intel] Using a single model tier for all requests regardless of complexity
Implement two-tier routing: a cheap classifier \(Haiku/Flash\) evaluates request complexity in ~100 tokens, then routes to the appropriate model tier. Route UP on ambiguity — false positives \(simple task on expensive model\) cost dollars; false negatives \(complex task on cheap model\) cost quality and rework. Typical savings: 60-80% with <2% quality degradation.
Journey Context:
In most production workloads, 70-80% of requests are simple \(classification, formatting, extraction\) and 20-30% are complex \(reasoning, creative generation, multi-step logic\). A Haiku-based router at $0.25/M input evaluating in 100 tokens costs $0.000025 per routing decision. If 75% of 1M daily requests are simple: 750K × $0.0005 \(Haiku\) \+ 250K × $0.005 \(Sonnet\) \+ 1M × $0.000025 \(router\) = $1,625/day vs $5,000/day for all-Sonnet — a 67% saving. The critical design choice: the router should be biased toward routing UP. Sending a simple task to Sonnet wastes ~$0.004 per call. Sending a complex task to Haiku produces garbage that costs human review time or requires re-processing on Sonnet anyway. The router itself can be a simple prompt: 'Rate this request's complexity: simple or complex. When uncertain, choose complex.' The failure mode to monitor: drift in request complexity distribution. If your user base starts submitting harder queries, the router's calibration breaks and quality degrades silently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:31:30.871769+00:00— report_created — created