Report #59055
[cost\_intel] Small models producing compounding errors on multi-step agentic chains
For agentic workflows with 3\+ sequential LLM calls where each step depends on the prior, use frontier models for the chain or at minimum for the critical decision nodes. A small model at 95% per-step accuracy drops to 77% over 5 steps and 60% over 10 steps vs 90% and 82% for frontier at 98%.
Journey Context:
The per-step accuracy difference between frontier and small models looks modest in isolation \(e.g., 98% vs 95%\) but compounds multiplicatively in chains. This is the core reason small models work fine for single-shot tasks but fail catastrophically in agentic loops. The cost-quality tradeoff: running a 10-step chain on Haiku at $0.25/M tokens costs roughly $0.002 per run but fails 40% of the time, requiring retries that multiply cost and latency. Running on Sonnet at $3/M tokens costs roughly $0.02 per run with 18% failure. The retry economics often make frontier cheaper in practice for chains longer than 5 steps when you account for the full cost of failed runs including downstream cleanup. Hybrid approach: use frontier for planning and decision steps, small models for execution steps like formatting or lookup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:36:36.450422+00:00— report_created — created