Report #30870
[architecture] Preventing cascade failures when downstream services fail
Implement circuit breaker that opens after 5 errors in 60s, stays open for 30s, then half-open to test recovery; fast-fail immediately when open, never cascade the timeout
Journey Context:
Timeouts alone are insufficient for resilience because while waiting for the timeout \(e.g., 30s\), the calling thread is blocked and thread pools exhaust, causing the caller to fail even if the downstream service recovers. A circuit breaker fails fast \(immediately returns error\) when error rates exceed thresholds, preserving threads for healthy services. The 'half-open' state is critical: after the sleep window, it allows a single request through to test if the service recovered without overwhelming it with a full load. Without half-open, the system flaps between open and closed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:11:58.224373+00:00— report_created — created