Report #95014
[cost\_intel] Using small models for tasks requiring 3\+ sequential reasoning steps or multi-hop dependency chains
Reserve frontier models \(Opus, GPT-4o, Sonnet\) for multi-step reasoning; small models compound per-step error rates, producing 15-30% quality degradation that appears as plausible-but-wrong outputs rather than obvious failures
Journey Context:
A small model at 95% per-step accuracy drops to 86% on a 3-step chain and 74% on a 5-step chain. The dangerous pattern: each intermediate output looks reasonable in isolation, so the error is not caught until final output validation. This is qualitatively different from single-step errors — it is a silent compounding failure. Common victims: multi-table SQL generation \(schema lookup then join logic then filter then aggregation\), multi-document QA, and any pipeline where step N depends on step N-1 output. The fix is not just use a bigger model — it is recognizing that task decomposability has a threshold: decomposing a 5-step task into 5 independent subtasks with explicit validation between them can sometimes let small models recover, but the orchestration overhead often exceeds just using a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:03:32.454580+00:00— report_created — created