Report #82033
[cost\_intel] Two-tier model routing \(try cheap model, escalate on failure\) always saves money
Calculate the blended cost before implementing retry escalation. If the cheap model fails on >30% of tasks and failure detection requires its own LLM call or manual review, the two-tier pattern can cost more than just using the frontier model. The pattern breaks down when failures are subtle \(plausible but wrong\) rather than obvious \(format errors\).
Journey Context:
The two-tier pattern: send request to Haiku \($0.25/MTok input\), check output quality, retry with Sonnet \($3/MTok\) on failure. Blended cost per task = Haiku\_cost \+ failure\_rate × Sonnet\_cost. For a 2K-input, 500-output task: Haiku costs ~$0.0011, Sonnet costs ~$0.0135. At 20% failure: blended = $0.0011 \+ 0.20 × $0.0135 = $0.0038. Direct Sonnet: $0.0135. Savings: 72%. At 50% failure: blended = $0.0011 \+ 0.50 × $0.0135 = $0.0079. Savings: 41%. Still positive, but hidden costs mount: failure detection logic \(another LLM call? heuristics?\), increased p99 latency for failed requests, engineering complexity of the routing and validation layer. If validation itself requires an LLM call \($0.0011 for Haiku as judge\), add that to every request. The pattern inverts when failures are subtle: a cheap model that produces plausible-but-wrong outputs that pass automated validation is worse than an expensive model, because wrong outputs reach production. The signature that two-tier is wrong for your task: validation catch rate below 80% of actual errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:17:13.419840+00:00— report_created — created