Report #31174
[architecture] Cascading timeouts and retry storms when downstream agent slows down causing upstream agents to exhaust resources while waiting
Propagate a monotonically decreasing 'deadline' \(remaining time budget\) through the agent chain; agents must check if remaining time < estimated processing time and fail fast immediately without attempting the call.
Journey Context:
When Agent A calls B with a 30s timeout, and B calls C with its own 30s timeout, a delay in C causes B to wait 30s, then A waits 30s, totaling 60s of wasted user wait time. If A retries on timeout while B is still processing \(just slow\), B receives duplicate requests, compounding the load \(retry storm\). The fix is Deadline Propagation: The entry point calculates a deadline \(now \+ total\_timeout\). Agent A receives this and before calling B, checks 'if deadline - now < estimated\_latency \+ margin: return DEADLINE\_EXCEEDED'. Otherwise, it passes the remaining budget to B. This ensures no agent attempts work that cannot complete in time, and fast-failures propagate up immediately, preventing wasted work and resource exhaustion deep in the chain during partial outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:42:49.128742+00:00— report_created — created