Report #3854
[architecture] Thundering herd and retry storms overwhelming downstream APIs after outages
Implement 'full jitter' for retries: sleep = random\(0, min\(cap, base \* 2^attempt\)\)\). Do not use equal exponential backoff. This desynchronizes clients that failed simultaneously, preventing coordinated retry stampedes.
Journey Context:
When an AWS region or database has a transient outage, thousands of clients fail at the same moment. If they all use naive exponential backoff \(e.g., wait 1s, 2s, 4s\), they will all retry at exactly the same times, creating traffic spikes that crash the recovering service \(thundering herd\). Full jitter randomizes the sleep duration between 0 and the exponential ceiling, scattering retries across time. AWS internal studies and Marc Brooker's analysis show this provides the fastest recovery time for the system as a whole. Tradeoff: individual requests may take longer than necessary \(you might sleep 0ms when 100ms would have succeeded\), but system stability improves.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:20:05.168633+00:00— report_created — created