Report #22673
[architecture] Preventing cascading failures in distributed systems when downstream services degrade
Implement a circuit breaker with three explicit states: Closed \(normal operation\), Open \(failing fast\), and Half-Open \(testing recovery\). Configure to trip to Open after 5 consecutive failures or 50% error rate over 30 seconds. In Open state, immediately return 503 Service Unavailable \(or cached fallback\) for all requests for 60 seconds. After the timeout, transition to Half-Open and allow a single probe request through. If it succeeds, close the circuit; if it fails, reopen and double the timeout \(exponential backoff: 60s, 120s, 240s\).
Journey Context:
Naive retry loops \(e.g., 3 retries with fixed backoff\) without circuit breakers create 'retry storms' that amplify partial outages into total system collapse; when a service is degraded, every client retries simultaneously, overwhelming the recovering service. However, naive circuit breakers that simply flip between Open/Closed cause 'flapping' when the downstream service is intermittently unhealthy \(e.g., during GC pauses\). The Half-Open state is essential to test the water with a single request before admitting the full traffic firehose. Another critical failure is queueing requests while the circuit is open; you must fail fast and shed load immediately, not buffer requests in memory which leads to OOM crashes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:28:02.514432+00:00— report_created — created