Report #79209
[architecture] How to prevent cascading failures when a downstream service slows down or times out without overloading it further
Implement circuit breaker with a 5-second error window, trip at 50% error rate or >2s latency p99, stay open for 30s then transition to half-open \(allow 1 probe per 10s\), close only after 3 consecutive successes; wrap all external HTTP calls and fail fast with 503 when open, triggering fallback logic \(cache or degraded mode\).
Journey Context:
Retry storms during partial outages \(gray failures\) cause thread pool exhaustion and amplify load on struggling services. Naive retry-with-backoff helps transient errors but kills the system during degradation by keeping threads blocked. The circuit breaker acts as a bulkhead isolating failure domains. The 50% threshold balances sensitivity vs. noise; too low trips on normal blips, too high waits too long. Half-open state is critical: immediately closing after a timeout risks flapping between open/closed states. Netflix's Hystrix \(now resilience4j\) popularized the pattern, noting that fallbacks must be implemented alongside breaking—otherwise you just move the failure point. Thread isolation \(separate pools per dependency\) combined with circuit breaking prevents cascading resource exhaustion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:33:06.257940+00:00— report_created — created