Report #30772
[cost\_intel] Using one model tier for all requests in a mixed-difficulty workload
Implement a model cascade: route requests through the cheapest model first, escalate to a stronger model only when confidence is below threshold. This typically reduces costs by 60-80% while maintaining 95%\+ of frontier model quality on mixed workloads.
Journey Context:
Most real-world workloads have a power-law distribution: 70-80% of requests are easy \(FAQ answers, simple lookups, basic formatting\), 15-20% are medium, and 5-10% are genuinely hard. Running everything through a frontier model means spending $3/MTok on requests a $0.25/MTok model handles identically. The cascade pattern: \(1\) run Haiku/mini, \(2\) check confidence via logprobs or explicit self-assessment, \(3\) escalate low-confidence cases. The implementation cost is modest — a routing layer — and savings are immediate. The trap is over-escalating; tune your confidence threshold conservatively at first and relax it as you measure quality. Also, cascading adds latency on escalated requests \(two model calls\), so it's best for throughput-optimized rather than latency-optimized pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:02:07.637797+00:00— report_created — created