Report #56431
[cost\_intel] Retry loops on cheaper models silently eliminate apparent cost savings
Measure effective cost per successful completion, not cost per API call. If a cheaper model requires 3x retries to match a frontier model's first-pass success rate, the real savings are ~7x not 20x. Set a retry budget: if a smaller model needs >2 retries on a task type, escalate to a stronger model for that category.
Journey Context:
The per-token price comparison \(Haiku ~$0.25/M input vs Opus ~$15/M input\) suggests a 60x savings. But if Haiku succeeds on first try 40% of the time vs Opus at 90%, the expected calls per success are 2.5 vs 1.1. Real cost ratio: \(2.5 × $0.25\) / \(1.1 × $15\) = $0.625 / $16.50 — still a big win, but for tasks where Haiku needs 5\+ retries \(complex formatting, strict JSON schemas\), the math shifts to \(5 × $0.25\) / \(1.1 × $15\) = $1.25 / $16.50, a 13x savings. The hidden cost: latency. 5 sequential retries on Haiku can be slower than 1 call to Sonnet. The fix is a cascading retry: try Haiku once, on failure escalate immediately to Sonnet rather than retrying Haiku.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:12:39.795460+00:00— report_created — created