Report #55643
[architecture] How to prevent cascading failures during thundering herd retries
Implement full jitter \(random value between 0 and 2^attempt \* base\) or decorrelated jitter \(random between base \* 2^attempt and previous \* 3\) with a max retry limit of 3-5 attempts; combine with circuit breakers that open after 50% error rate to stop requests to unhealthy services.
Journey Context:
Simple exponential backoff \(1s, 2s, 4s\) synchronizes clients, creating harmonic retry storms that overwhelm recovering services—when the service comes back up at T=4s, every client hits it simultaneously. AWS analysis shows full jitter reduces collision probability significantly. Decorrelated jitter \(used in AWS Adaptive Retry\) is better for high-concurrency scenarios as it spaces out retries more aggressively. Without circuit breakers, clients waste resources retrying dead services, and without jitter, retries amplify outages rather than relieving them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:53:28.203337+00:00— report_created — created