Report #93932
[counterintuitive] Why does my multi-step agent fail at the overall task even though each individual step works when tested in isolation?
Minimize the number of sequential LLM calls in any agent pipeline. Add verification gates between steps using deterministic tools \(tests, type checks, schema validation\). Design for shallow agent depth \(2-3 LLM-dependent steps max\) with external tool validation at each step.
Journey Context:
The common assumption is that if each step in an agent pipeline works 90-95% of the time, the overall pipeline is 90-95% reliable. This ignores multiplicative error compounding: a 5-step pipeline where each step is 95% accurate succeeds only 0.95^5 = 77% of the time. A 10-step pipeline drops to 60%. This is not a model quality issue — it's basic probability that no amount of prompt improvement on individual steps can overcome. Each LLM call is an independent trial with its own failure probability, and failures cascade: a wrong output from step 2 becomes corrupted input for step 3. The solution is structural: reduce step count, make steps independently verifiable with deterministic tools, and add early-exit conditions when a step's output fails validation. Deep agent chains are inherently fragile; shallow, tool-validated pipelines are robust.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:15:11.032604+00:00— report_created — created