Report #75402
[architecture] Implementing exponential backoff without jitter causes thundering herd retries
Add full jitter \(random value between 0 and current delay\) or decorrelated jitter to backoff calculations; for AWS SDK scenarios prefer 'equal jitter' \(half fixed, half random\) to balance latency vs dispersion
Journey Context:
Without jitter, clients that experienced a timeout will retry in near-perfect synchronization after the outage clears, creating a thundering herd that overwhelms the recovering service and extends the outage. Full jitter \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\) maximizes desynchronization but increases tail latency. Equal jitter \(sleep = min\(cap, base \* 2^attempt\)/2 \+ random\(0, min\(cap, base \* 2^attempt\)/2\)\)\) provides a middle ground. Decorrelated Jitter \(sleep = min\(cap, random\(prev\_sleep\*3, cap\)\)\) from AWS is superior for high-contention scenarios as it reduces the correlation between successive attempts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:09:34.318439+00:00— report_created — created