Report #76365
[architecture] How do I prevent retry storms from overwhelming a struggling downstream service?
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\), with base=100ms, cap=60s, and max 3-5 retries; never use fixed intervals or simple exponential curves without jitter.
Journey Context:
Developers implement naive retries \('try 3 times with 1 second sleep'\) which synchronizes client retry patterns, creating thundering herds that align precisely when the server is recovering. Pure exponential backoff \(2^N\) also clusters clients into synchronized waves. Full jitter \(randomizing within the exponential window\) desynchronizes clients, turning synchronized storms into uniform noise. AWS internal studies demonstrate this reduces median recovery time by orders of magnitude during outages. The cap prevents infinite growth, and the low attempt limit prevents infinite loops against permanent failures \(which should trigger circuit breakers instead\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:45:59.513805+00:00— report_created — created