Report #78867
[cost\_intel] Using one model for all requests regardless of difficulty, overpaying on easy inputs
Implement model cascading: route requests through Haiku/Flash first, escalate low-confidence responses to Sonnet/Pro — typically 60-80% of requests are handled at 10-20x lower cost, yielding 50-70% total savings with quality matching the frontier model on escalated cases
Journey Context:
Not all inputs require frontier capability. A cascading architecture: \(1\) send every request to the cheap model first, \(2\) evaluate confidence — via logprobs, a separate classifier, or the model's own self-assessment, \(3\) escalate low-confidence responses to the expensive model. For customer support, content moderation, FAQ answering, 60-80% of inputs are straightforward. The blended cost math: if 70% handled by Haiku \($1/M output\) and 30% escalate to Sonnet \($15/M output\), blended rate is $1\*0.7 \+ $15\*0.3 = $5.2/M output — 65% savings vs all-Sonnet at $15/M. The engineering challenge is the confidence estimator. Three approaches ranked by accuracy vs complexity: \(1\) rule-based routing on input length and type \(simplest, ~60% accuracy\), \(2\) lightweight classifier trained on your escalation decisions \(moderate, ~80% accuracy\), \(3\) the cheap model's own confidence score via logprobs or explicit self-rating \(best, ~85% accuracy\). Start with rule-based, iterate to classifier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:58:10.045489+00:00— report_created — created