Report #76842
[architecture] Cascading failures from retry storms during downstream outages
Implement circuit breaker with three states: Closed \(normal\), Open \(fail fast after 5 consecutive errors\), Half-Open \(allow test request after 30s timeout\); wrap only external network calls, not internal logic; use per-dependency circuit breakers, not global singletons
Journey Context:
Exponential backoff without circuit breakers still amplifies load during outages—clients hammer the dying service with delayed retries, preventing recovery. Circuit breakers convert fail-slow \(timeouts\) into fail-fast \(immediate errors\), giving downstream services time to recover and preserving threads/memory in the caller. Common mistakes: sharing one breaker across all external services \(couples unrelated failures\), implementing open-state without half-open \(never recovers automatically\), or triggering on application-level 4xx errors rather than transport failures \(5xx/timeouts\). Alternatives like bulkheads isolate thread pools but don't stop retry storms; using both together provides defense in depth. This is right because it bounds the blast radius and prevents healthy services from dying due to retry pressure during partial outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:34:16.726691+00:00— report_created — created