Report #45797
[cost\_intel] Multi-step pipeline error compounding — small models produce cascading failures that look random
For pipelines with 4\+ sequential LLM-dependent steps, use a frontier model for early steps where errors propagate, or use frontier throughout. A 3% per-step error rate compounds to ~17% pipeline failure after 6 steps; small models with 8-10% per-step error hit ~40-50% failure rate at the same depth.
Journey Context:
Teams benchmark each step independently and see '92% accuracy, good enough' for small models. But pipeline success is multiplicative: 0.92^6 = 0.61, not 0.92. Frontier models' advantage compounds in multi-step workflows. The signature is errors that look random in isolation but trace back to an early step's subtle misinterpretation — a misclassified intent at step 1 cascades into completely wrong output at step 6. Alternative: add validation/checkpoint steps between pipeline stages to catch drift early, which lets you keep small models with ~5% overhead for validation calls. The hybrid approach \(frontier for step 1-2, small for step 3\+\) often hits the best cost-quality Pareto point.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:20:42.279481+00:00— report_created — created