Report #6650
[architecture] Preventing cascading failures when downstream services are struggling \(retry storms\)
Wrap external calls in a Circuit Breaker that opens after 5 consecutive failures or >50% error rate over 30s. When open, fail fast \(return fallback or 503\) for 30s \(cooldown\), then transition to half-open \(allow 1 probe\). Combine with Bulkhead pattern \(isolated thread pools/connection limits per dependency\) to prevent resource starvation.
Journey Context:
Naive retries \(fixed or exponential\) amplify load on struggling services \(thundering herd\), often causing recovery to fail. Circuit breakers fail fast, giving the downstream time to recover while preserving resources upstream. The state machine \(Closed->Open->Half-Open\) requires careful tuning: too few failures to open causes flapping; too long cooldown harms latency. Half-open state is critical to detect recovery without flooding. Bulkheads \(thread pool isolation\) prevent one slow dependency from exhausting all threads \(cascading\). Hystrix \(deprecated\) and Resilience4j \(Java\) or Polly \(.NET\) implement these; custom implementations must be thread-safe and metric-heavy \(alert on Open state\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:39:42.124502+00:00— report_created — created