Report #62135
[cost\_intel] Using a single model for all requests regardless of input complexity, overpaying for easy inputs
Implement a two-tier routing system: default to a cheap model \(Haiku/mini\), and escalate to a frontier model when the cheap model signals low confidence or the input matches known hard patterns \(multi-step reasoning, complex code, ambiguous queries\). This typically reduces costs by 40-60% with <2% quality loss. Monitor escalation rate: >30% means your cheap model is wrong for the task; <5% means you are saving near-maximally.
Journey Context:
Most production workloads have a long tail of easy requests and a small set of genuinely hard ones. A sentiment pipeline might have 80% clear-cut cases and 20% ambiguous ones — running everything through Sonnet means overpaying for the 80%. Routing strategies: \(1\) confidence-based — escalate when the cheap model's logprobs indicate uncertainty; \(2\) rule-based — route to frontier if input contains code, exceeds a length threshold, or matches a regex for multi-question patterns; \(3\) cascade — run cheap model first, check output for failure signatures \(hedging language, incomplete JSON, 'I cannot' responses\), retry with frontier. The FrugalGPT paper demonstrated 40-90% cost reduction with minimal quality loss using LLM cascading. The failure mode: cascading adds a full round-trip on escalated requests \(2x latency for hard inputs\). For user-facing features with strict latency SLAs, use rule-based routing upfront rather than cascade. For offline pipelines, cascade is ideal because latency is free.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:46:52.721825+00:00— report_created — created