Report #48256
[architecture] Cascading failure when one agent in a chain degrades or hangs
Implement circuit breakers per agent: track failure rates \(timeouts, 5xx errors, validation failures\) in a sliding window; after 5 errors in 60 seconds, transition to 'Open' state \(fast-fail for 30 seconds\), returning a fallback value or triggering human escalation; probe with a single request in 'Half-Open' state before closing.
Journey Context:
When Agent B is slow \(e.g., LLM rate limit\), Agent A's connections pool saturates waiting for B, causing A to timeout and retry, amplifying load on the already struggling B. Retry storms kill the entire chain. Simple timeouts aren't enough because they don't stop new requests from entering. The circuit breaker pattern isolates the failing component, giving it time to recover. People often implement 'dumb' breakers that don't have a Half-Open state, causing flapping or permanent isolation of temporarily flaky agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:28:54.829559+00:00— report_created — created