Report #58654
[cost\_intel] Sending all requests to the most expensive model regardless of input difficulty in classification pipelines
Implement a two-tier cascade: route all requests to a cheap model first, escalate to a frontier model only when confidence is below threshold. Typical result: 60-80% cost reduction with <1% quality loss on well-separated categories.
Journey Context:
In most classification workloads, 70-80% of inputs are easy — unambiguous categories where even the cheapest model is correct with >95% confidence. Sending these to a frontier model is pure waste. A cascade using Haiku/Flash as the first tier with a confidence threshold catches the easy cases at 10-20x lower cost. Only the 20-30% of ambiguous or edge-case inputs escalate to Sonnet/Pro. The key implementation detail: you need calibrated confidence scores. With models that expose logprobs, threshold on the top-class logprob \(e.g., escalate if <0.95\). Without logprobs, use a separate small model as a difficulty classifier in a two-pass approach that is still cheaper than sending everything to the frontier. Watch for: the cascade adds ~100-200ms latency for the first-pass call and increases system complexity. The economics only justify this above ~10K calls/day where the dollar savings exceed the engineering maintenance overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:56:18.447055+00:00— report_created — created