Report #44867
[frontier] Hard-coded model selection causes SLA violations or unnecessary costs based on query complexity
Implement latency-budgeted cascading: start with fast/cheap model, escalate to powerful model only if confidence is below threshold AND remaining latency budget permits, with explicit deadline tracking
Journey Context:
Using GPT-4 for all queries is reliable but violates latency SLAs; using GPT-3.5 for complex queries fails and requires retries. FrugalGPT demonstrated that cascading models based on confidence scores optimizes cost-latency tradeoffs. The 2025 production pattern adds explicit latency deadlines \(deadline-aware scheduling\) to prevent escalation when the remaining budget is insufficient to complete the slower call. Alternative was static routing rules that failed under load spikes or complex inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:46:27.477256+00:00— report_created — created