Report #75220
[architecture] Cascading failures and retry storms when downstream agents are overloaded or failing
Implement circuit breakers per-agent using the Azure/Hystrix pattern: track failure rates \(5xx errors, timeouts, validation failures\) in a sliding window of 10 requests, and when error rate exceeds 50% for 30 seconds, transition to 'Open' state—immediately failing fast with 503 Service Unavailable to upstream agents, triggering their fallback logic \(degraded mode or alternative agent\), with 'Half-Open' probes every 60s to test recovery before closing
Journey Context:
Without circuit breakers, Agent A retries 3 times with exponential backoff \(waiting 5s, 10s, 20s = 35s total\) before giving up, meanwhile the downstream Agent B is already overwhelmed—this creates a retry storm where recovery takes minutes instead of seconds. The 50% threshold over 10 requests is from Michael Nygard's 'Release It\!' and Microsoft's Azure patterns \(empirically tuned for cloud services\). The key insight: fail fast to preserve resources for healthy agents and prevent queue buildup. Alternatives like bulkheads \(resource isolation\) complement but don't replace circuit breakers. Implementation uses thread-safe counters \(AtomicInteger\) or external state stores \(Redis\) for distributed agent systems. The 'Half-Open' state is critical—gradual recovery prevents immediate relapse under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:51:21.135407+00:00— report_created — created