Report #77033
[cost\_intel] Using small models for multi-step reasoning pipelines without accounting for error compounding across steps
For pipelines with 3\+ sequential LLM calls where each step depends on the previous output, use a frontier model \(Sonnet, GPT-4o\) for the chain. If each step is 95% accurate with a small model vs 99% with a frontier model, after 5 sequential steps end-to-end accuracy drops to 77% vs 95%. The 20x per-call savings becomes a net loss when you factor in error recovery, retry logic, or manual review costs.
Journey Context:
The trap: each individual step looks fine in isolation during testing. The small model gets 93-96% on each step, which seems acceptable. But in a sequential pipeline, errors compound multiplicatively — not additively. The signature of this failure mode: the final output is wrong, but each intermediate step looks plausible in isolation, making debugging extremely difficult. This is where frontier models are genuinely irreplaceable: not for any single step, but for compound accuracy across a chain. Practical mitigation: use small models for independent parallel steps \(error doesn't compound\) and a frontier model only for the sequential dependency chain. Another pattern: use a small model for all steps but add a frontier model validator at the end that checks the final output — catching compound errors without paying frontier prices for every step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:53:30.914254+00:00— report_created — created