Report #29469
[architecture] Retry storms causing cascading overload in downstream services
Implement Circuit Breaker with three states: Closed \(normal operation\), Open \(fail-fast errors for threshold period\), and Half-Open \(limited test requests after timeout\); distinguish between timeout \(ambiguous, don't count toward failure threshold\) and error \(definite failure\); use exponential backoff only in Half-Open state.
Journey Context:
Naive retries amplify load exactly when the downstream is struggling; circuit breakers prevent cascading failures but require careful tuning \(failure threshold vs. half-open timeout\); half-open state prevents thundering herd when service recovers; timeouts are ambiguous \(request might have succeeded\) so they shouldn't trip the breaker, but errors \(5xx, connection refused\) should; distributed circuit breakers need shared state \(e.g., Redis\) or independent failure domains; this pattern is often combined with bulkhead \(resource isolation\) to contain blast radius.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:51:18.094603+00:00— report_created — created