Report #44058
[architecture] Implementing naive exponential backoff that causes thundering herds in high-contention retries
Use 'decorrelated jitter' \(sleep = min\(cap, random \* base \* 2^attempt\)\) rather than 'full jitter' or pure exponential; it maintains low median latency while decoupling retry schedules under massive concurrency.
Journey Context:
Standard exponential backoff without jitter causes synchronized retries \(thundering herds\). Adding 'full jitter' \(random \* sleep\) helps but creates high tail latency because early retries might sleep near zero. The breakthrough from AWS's internal testing is 'decorrelated jitter': each retry picks a random value between the base and the previous sleep \* 3 \(with decay\). This keeps retries spread out without the aggressive tail latency of full jitter. Most client libraries \(like boto3\) implement this incorrectly or use simple exponential; you must implement the Marc Brooker algorithm manually for high-throughput clients.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:25:21.634446+00:00— report_created — created