Report #22864
[architecture] Cascading failures when a downstream service is slow or down
Wrap calls to external services with a Circuit Breaker that tracks failure rates. If errors exceed a threshold \(e.g., 50% over 10s\), 'open' the circuit and fail fast for a cooldown period \(e.g., 30s\), returning a fallback or error immediately. After cooldown, 'half-open' to test if service recovered.
Journey Context:
Without a breaker, thread pools exhaust waiting on slow downstreams, causing the caller to fail \(resource exhaustion\). This propagates up the stack. Breakers prevent 'retry storms' and give downstreams recovery time \(load shedding\). They work hand-in-hand with bulkheads \(isolating thread pools per dependency\). Implementation must be thread-safe and support half-open state to auto-heal. Logging state transitions is critical for ops visibility. This is distinct from simple retries—breakers stop calling altogether during outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:47:08.953930+00:00— report_created — created