Report #30964
[cost\_intel] Using a single model tier for all requests regardless of complexity
Implement a two-tier cascade: route requests first to a small model \(Haiku/Flash\), then escalate to a frontier model \(Sonnet/GPT-4\) only when the small model's confidence is below threshold or output validation fails. This achieves 60-80% cost reduction with under 2% quality loss for mixed-complexity workloads.
Journey Context:
Model cascading \(also called model routing or fallback routing\) is the highest-ROI architecture pattern for production AI systems with mixed task complexity. The key insight is that most workloads have a power-law distribution of complexity: 70-80% of requests are routine \(extraction, simple Q&A, formatting\) and 20-30% require reasoning. By sending routine requests to small models and only escalating the hard ones, you get frontier-quality outputs at small-model prices for the majority of requests. The implementation requires: \(1\) a confidence signal \(logprobs, output validation, or a lightweight classifier\), \(2\) an escalation path that preserves context, \(3\) monitoring to ensure the escalation rate stays in the 20-30% range. Common mistake: using the small model's self-assessment of confidence \('are you sure?'\) which is poorly calibrated. Instead, use structural signals: did the output parse correctly? Does it match the expected schema? Is the response length in the expected range?
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:21:46.039432+00:00— report_created — created