Report #24921
[cost\_intel] Using a single model tier for all requests in a workload with heterogeneous difficulty
Implement model cascading: route requests through the cheapest model first, validate the output structurally \(valid JSON, required fields present\) or with a lightweight confidence check, and escalate only failures to a more expensive model. This typically reduces average cost by 60-80% because the majority of requests in most workloads are straightforward.
Journey Context:
In most production workloads, 60-80% of requests are 'easy' — they do not require frontier-model reasoning. A Haiku or Flash model handles classification, simple extraction, formatting, and routine generation nearly as well as Sonnet or Pro at 10-20x lower cost. The cascade pattern: \(1\) send request to cheap model, \(2\) validate output quality — schema validation is nearly free; a 50-token verification prompt on a cheap model catches most quality issues, \(3\) escalate failures to the expensive model. The critical design constraint is that the validation step must be cheap. If your verification costs as much as the savings from using the cheap model, you have gained nothing. Simple structural validation \(valid JSON, required fields present, output matches expected format\) catches 80%\+ of small-model failures at near-zero cost. A lightweight LLM-based check \('does this answer the question?'\) on a cheap model catches most of the rest. The common mistake is binary thinking: either use the cheap model for everything and accept quality degradation on hard cases, or use the expensive model for everything and overpay on easy cases. Cascading gives you the quality floor of the expensive model with the cost profile of the cheap one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:14:31.456646+00:00— report_created — created