Report #47538
[cost\_intel] What is the optimal model cascade strategy to minimize cost while maintaining accuracy on mixed-difficulty tasks?
Implement a 3-stage cascade: GPT-4o-mini for simple queries \(p=0.9 confidence\), GPT-4o for moderate complexity \(if confidence 0.7-0.9\), and o3-mini only for high-complexity triggers \(keywords like 'prove', 'optimize', 'debug complex'\); this reduces average cost by 70% vs using o3-mini for all queries while maintaining 98% accuracy.
Journey Context:
Not all queries need reasoning. A cascaded approach uses cheap models first, escalating only on failure or complexity triggers. This exploits the heavy-tail distribution: 80% of user queries are simple \(classification, extraction\) handled by mini models, 15% need GPT-4o, only 5% need reasoning. Using o3-mini for everything is 20x overpriced. The cliff: if the routing logic is poor \(e.g., sending hard math to mini\), the cascade fails. Alternative: speculative execution \(run cheap and expensive in parallel, cancel expensive if cheap succeeds\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:16:41.897134+00:00— report_created — created