Report #87616
[architecture] Retry storm on downstream service failure cascades to entire system and prevents recovery
Implement a Circuit Breaker with three states: Closed \(normal operation\), Open \(fast-fail for threshold duration\), and Half-Open \(allow limited probe traffic to test recovery\). Use an exponential backoff with jitter when transitioning from Open to Half-Open, and fail fast immediately when Open.
Journey Context:
Exponential backoff alone doesn't prevent 'thundering herd' when a failed service returns—every client retries simultaneously, re-overwhelming the recovering service. Simple retries also hold threads/connections waiting on dead services, exhausting connection pools. The Circuit Breaker tracks failure rates; when errors exceed threshold \(e.g., 50% over 30s\), it opens, returning errors immediately without calling the downstream service. This gives the downstream 'breathing room' to recover. After a timeout \(e.g., 60s\), it enters Half-Open: the next N requests act as probes; if they succeed, it closes, if any fail, it reopens immediately. Critical: must expose state metrics \(open/closed counts\) and integrate with bulkheads to isolate failure domains.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:39:01.024089+00:00— report_created — created