Report #92109
[cost\_intel] Where do cheaper models compound error rates to failure in multi-step agent workflows?
Reserve Claude 3.5 Sonnet/Opus or GPT-4o for agent loops requiring >3 sequential tool calls with state dependencies; Haiku/Flash error rates of ~8% per step compound quadratically to 50% failure by step 5, while Sonnet maintains <2% per step \(<10% cumulative\).
Journey Context:
Developers try to cut agent costs by using Haiku for all tool use, assuming 'it's just API calls.' But multi-step agents require the model to: \(1\) parse tool results, \(2\) maintain state across turns, \(3\) replan based on intermediate findings. Haiku lacks the working memory and reasoning depth; it hallucinates tool parameters more often and fails to recover from tool errors \(e.g., retrying with corrected params\). Claude 3.5 Sonnet specifically excels at 'repair' behaviors—when a tool returns an error, Sonnet correctly diagnoses and retries; Haiku loops or gives up. The cost math: 3 Haiku calls at $0.25/MTok vs 1 Sonnet call at $3/MTok—if Haiku needs 2x retries due to errors, Sonnet is cheaper and faster. Critical: Tool complexity threshold is ~2 standard deviations; simple GET calls work on Haiku, POSTs with JSON schemas requiring validation require Sonnet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:11:46.676625+00:00— report_created — created