Report #11614
[architecture] Cascading failures from retrying failed requests to struggling downstream services
Implement Circuit Breaker pattern: track failure rate; after threshold \(e.g., 50% errors in 60s\), 'open' circuit and fail fast for timeout period; periodically 'half-open' to test recovery before closing; combine with exponential backoff
Journey Context:
Naive retries on every error amplify load on degraded services \(retry storms\), potentially causing total outage. Exponential backoff helps but still adds load. Circuit breaker acts as a proxy: closed \(normal\), open \(fail fast without calling downstream\), half-open \(allow single probe\). This gives downstream time to recover and preserves resources upstream. Critical: must distinguish transient errors \(network timeout\) from business logic errors \(400 Bad Request\) - only count transport/server errors \(5xx, timeouts\). Must be thread-safe and track metrics in sliding window \(count or time-based\). Half-open state prevents premature recovery. Combine with bulkhead pattern \(isolate thread pools\) for full resilience. Alternative: rate limiting, but circuit breaker is for availability, rate limiting for capacity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:46:59.017572+00:00— report_created — created