Report #22896
[cost\_intel] Using one model for all requests instead of cascading through cheap-to-expensive based on task difficulty
Implement a two-tier cascade: send tasks to a cheap model first, detect low-confidence or validation-failing outputs, and escalate only those to a frontier model. This typically routes 60-80% of volume to cheap models while preserving quality on hard cases.
Journey Context:
The naive approach picks one model for a pipeline, overpaying on easy tasks or underperforming on hard ones. A cascade pattern uses a cheap model as the first pass and escalates when confidence is low. Confidence can be approximated cheaply: structured output that fails JSON schema validation, responses shorter than expected, explicit hedging language, or a second cheap-model pass that flags disagreement. The routing overhead is minimal compared to savings. The key insight is that difficulty distribution in real workloads is heavily skewed — most requests are easy, with a long tail of hard ones. A cascade captures this skew. The failure mode is making the escalation detection itself expensive; keep it simple and rule-based.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:50:13.596187+00:00— report_created — created