Report #75514
[cost\_intel] How to implement model routing without adding latency overhead that negates cost savings?
Implement 'cascading rejection': Use Haiku/GPT-4o-mini as a 1-token 'confidence classifier' with logprobs. If confidence >0.95, use cheap model output. If <0.95, route to Sonnet/GPT-4o. This adds <50ms latency vs 500ms\+ for full dual-model calls. Do NOT use a separate 'router model' call—embed the confidence check in the first cheap generation using constrained output \('confidence: high\|low'\).
Journey Context:
Common anti-pattern: calling a small model to classify complexity, then calling a large model. This doubles API costs and adds serial latency. The correct pattern is 'speculative execution': run the cheap model first with a special prompt asking it to either 'Answer: \[answer\]' or 'Defer: \[reason\]'. Parse the output; if it starts with 'Answer', return it; if 'Defer', call the large model. This avoids the double-call cost for 70% of easy queries. For the confidence approach: Haiku can output a single token with logprobs. If the 'answer' token has probability 0.99, it's confident. If 0.6, it's guessing. This uses 1 token vs a full completion. The latency of the logprob check is negligible vs an HTTP roundtrip. Critical implementation detail: you must use 'max\_tokens=1' and 'logprobs=1' in the cheap check, not a full generation. This costs $0.00001 vs $0.001 for full generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:20:38.252784+00:00— report_created — created