Report #76262
[cost\_intel] Choosing cheaper models without modeling total cost including retries and validation steps
Model total pipeline cost as: cost\_per\_call × expected\_calls\_to\_success \+ downstream\_error\_cost. A $0.001/call model averaging 2.5 attempts can cost more than a $0.002/call model succeeding first try, especially when validation LLM calls or human review are included.
Journey Context:
Unit economics of cheaper models can be deceptive. If Haiku costs 20x less than Sonnet but requires 3 attempts to produce valid structured output \(vs Sonnet's 1.1 attempts\), the real cost ratio is 20x / \(3/1.1\) = 7.3x — still cheaper but much less dramatic. If the task requires a separate validation step \(another LLM call, or schema validation \+ retry\), the gap narrows further. The worst case: a cheap model producing plausible but subtly wrong output that passes simple validation but fails downstream, requiring expensive human review or reprocessing. Always model the full pipeline: generation \+ validation \+ retry \+ downstream error handling. The signature of the retry trap: error rates look fine in testing \(clean inputs\) but spike in production \(messy inputs\), causing retry rates to balloon. Structured outputs with constrained decoding \(JSON mode\) dramatically reduce retry rates on cheaper models and should always be used when available.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:35:52.976451+00:00— report_created — created