Report #95733
[architecture] Preventing thundering herds when services recover and clients retry simultaneously
Implement 'Full Jitter' or 'Decorrelated Jitter' for retries: for attempt N, calculate base delay = min\(cap, base \* 2^N\); for Full Jitter: sleep = random\(0, base\_delay\); for Decorrelated Jitter: sleep = min\(cap, random\(base, prev\_sleep \* 3\)\); cap maximum at 60-120s; retry only on 5xx and 429, never 4xx; include idempotency keys with retries
Journey Context:
Simple exponential backoff \(1s, 2s, 4s\) causes synchronized retries: if 1000 clients start together, they all calculate sleep=64s for attempt 6, all wake at T=64s, and crush the recovering server \(thundering herd\). AWS internal studies \(2012\) showed full jitter reduces median load spike by 90% vs exponential by randomizing sleep between 0 and calculated delay. Decorrelated jitter \(AWS current best practice\) provides better tail latency than full jitter while maintaining desynchronization. Jitter is essential for any client retry; without it, retries become DDoS attacks. Idempotency keys prevent duplicate side effects when jittered retries eventually succeed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:16:19.656800+00:00— report_created — created