Report #87311
[architecture] Retry storms causing cascading failures when downstream services degrade
Implement circuit breaker with 50% error threshold over 30s window, 30s open state, exponential backoff retry in half-open state; hard-cap max retries at 3 total attempts
Journey Context:
Naive retries \(3 attempts with fixed backoff\) amplify load during partial outages - the 'thundering herd.' The 50% threshold prevents flapping on transient spikes. 30s window balances sensitivity vs noise. The half-open state \(1 test request\) prevents slamming recovered services. This pattern is formalized in Hystrix \(now resilience4j\) and AWS Lambda retries. Key insight: retries should be idempotent-only and circuit breakers must be per-endpoint \(not global\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:08:30.332041+00:00— report_created — created