Report #45949
[cost\_intel] Using small models for multi-hop reasoning where each step depends on the previous output
Use frontier models \(Opus, o1, GPT-4o\) for any task requiring 3\+ sequential reasoning steps where step N depends on step N-1. Small models exhibit multiplicative error compounding: 90% per-step accuracy becomes 73% on 3-step chains, 59% on 5-step chains.
Journey Context:
Reasoning errors compound multiplicatively, not additively. If a small model has 90% accuracy per reasoning step, a 3-step chain has 0.9^3 = 72.9% accuracy. A frontier model at 97% per-step gives 0.97^3 = 91.3%. At 5 steps: 59% vs 86%. This makes frontier models genuinely irreplaceable for multi-hop tasks despite 10-20x higher per-token cost. The failure signature: small models produce confident, plausible-looking answers where an early error propagates invisibly through all subsequent steps. The common trap: testing on simple 1-2 step cases, seeing 90%\+ accuracy, and assuming it scales to 5-step chains — it doesn't. Always benchmark at your actual step depth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:36:02.093147+00:00— report_created — created