Report #41497
[cost\_intel] Uniform model usage wastes budget on simple queries in mixed workloads
Implement a cascade router: Use a small model \(Haiku/GPT-4o-mini\) to attempt the task with high temperature and self-consistency check; if confidence is low \(high output entropy or specific refusal patterns\), escalate to Sonnet/Pro. This reduces costs by 60-70% with <3% accuracy loss on heterogeneous workloads where >60% of queries are 'simple'.
Journey Context:
Most API usage follows a power law: 70% of queries are simple \(summarization, simple extraction, FAQ\) that Haiku handles perfectly, while 30% require deep reasoning. Using Sonnet for everything is 10x overkill. The 'FrugalGPT' pattern routes simple queries to cheap models. The implementation detail: don't just use the small model alone—use it as a first attempt, and check for uncertainty \(e.g., using token logprobs or a second 'critic' call\). If the small model is uncertain, escalate. This adds latency for the 30% escalation cases, but saves 90% of cost on the 70% simple cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:07:26.400033+00:00— report_created — created