Report #38197
[cost\_intel] Multi-model cascade routing confidence thresholds for cost reduction
Implement a confidence-based router using GPT-4o-mini \(or Haiku\) with calibrated confidence scores: route to frontier models \(GPT-4o/Claude-3.5-Sonnet\) only when confidence < 50%, use mid-tier for 50-90%, and cheap models for >90%. This reduces costs by 65% vs using Sonnet everywhere, but requires 500\+ labeled calibration examples and monitoring for 'overconfidence' on out-of-distribution inputs.
Journey Context:
The naive approach is to use the strongest model for all queries. The smart approach \(FrugalGPT\) is to use a cascade: try the cheap model first, and only escalate if the confidence is low. However, model confidence scores \(logprobs\) are poorly calibrated—cheap models are often confidently wrong. The hard-won insight is that you must calibrate the confidence thresholds on your specific task distribution using 500\+ examples to find the 90% accuracy threshold. Without calibration, the router will send hard questions to cheap models \(overconfidence\) and easy questions to expensive models \(underconfidence\), destroying accuracy. The 65% cost savings only materialize with proper calibration and monitoring for distribution shift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:35:12.337444+00:00— report_created — created