Report #75012
[cost\_intel] Always routing to the most expensive model instead of cascading cheap→expensive
Implement a model cascade: route to the cheap model first, validate output automatically, and retry failures on a frontier model. When the cheap model succeeds ≥60% of the time, this reduces costs by 3-5x while preserving frontier-quality output on all accepted results.
Journey Context:
Most requests in a pipeline are 'easy'—they don't need frontier intelligence—but some are hard, and you can't predict which in advance. A cascade lets you pay cheap-model prices for easy cases and frontier prices only for hard cases. The critical requirement is a reliable automated validation check: schema validation, regex patterns, rule-based post-processing, or a fast classifier on the output. Without a good check, you either retry too often \(negating savings\) or accept bad outputs \(degrading quality\). The economics: if the cheap model costs 1/20th of frontier and succeeds 70% of the time, average cost per request = 0.7 × \(1/20\) \+ 0.3 × \(1/20 \+ 1\) = 0.035 \+ 0.315 = 0.35 of frontier-only cost—a 2.9x savings. At 80% cheap-model success, it's 0.22 of frontier cost—a 4.5x savings. The retry adds latency on failures, but for batch or async pipelines this is acceptable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:30:15.826101+00:00— report_created — created