Report #71009
[cost\_intel] Multi-step reasoning chains in small models produce compounding errors
For tasks requiring 3\+ sequential reasoning steps \(multi-hop QA, chained tool calls, complex math\), use frontier models. Small models exhibit multiplicative error compounding: at 90% per-step accuracy, a 5-step chain yields ~59% final accuracy. Frontier models at 97% per-step yield ~86%. The 10-20x cost premium is justified when the task is inherently sequential.
Journey Context:
The per-step accuracy difference between small and frontier models looks small in isolation \(maybe 90% vs 97%\). But for multi-step tasks, errors compound multiplicatively, not additively. This is the key insight most people miss — they benchmark single-step accuracy and assume it scales linearly. The signature failure mode of small models on multi-step tasks: each intermediate step looks plausible in isolation, but the final answer is wrong. This is especially dangerous because the errors are hard to catch without verifying every step. For agentic workflows with tool use, this compounds further because a wrong intermediate step may trigger the wrong tool call, producing garbage input for the next step. Cost comparison: Sonnet at $3/M input vs Haiku at $0.80/M input — 3.75x more expensive but often 2-3x more accurate on 5-step chains.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:46:13.374319+00:00— report_created — created