Report #55224
[architecture] Cascading resource exhaustion and retry storms during downstream agent outages
Implement circuit breakers \(fail-fast after N consecutive failures\) and bulkheads \(thread pool isolation per agent\) around external agent calls, combined with exponential backoff and dead-letter queues for failed outputs to enable forensic replay without blocking healthy chains.
Journey Context:
Naive retries amplify load during outages, causing cascade failure. Alternatives: Infinite retries \(resource exhaustion\), immediate failure \(poor availability\). The right call is circuit breakers \+ DLQs because they prevent cascading failures while preserving failed context for later processing; bulkheads ensure one slow agent doesn't starve others, maintaining partial availability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:11:11.067161+00:00— report_created — created