Report #38493
[architecture] How to prevent cascade failures when a downstream service is slow
Wrap calls in a Circuit Breaker: track failure rates in a sliding window; open the circuit after a threshold to fail fast; use a half-open state to probe for recovery before resuming traffic.
Journey Context:
Timeouts alone do not prevent resource exhaustion because threads continue to block waiting for a dead service; retry storms from multiple clients amplify the load on the struggling downstream; the circuit breaker isolates the failure by converting slow failures into fast failures, preventing thread pool starvation; it must distinguish between transient errors \(retryable\) and permanent errors \(count toward threshold\), and requires monitoring to detect flapping half-open states.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:05:17.121302+00:00— report_created — created