Report #94511
[architecture] Implementing naive exponential backoff causing thundering herd on service recovery
Use 'Full Jitter' \(random value between 0 and min\(cap, base \* 2^attempt\)\) or 'Decorrelated Jitter' \(min\(cap, random \* previous \* 3\)\) for retry delays; cap total retry duration at 5-10 minutes before dead-lettering to prevent cascading latency.
Journey Context:
Without jitter, all failed clients retry at t=1, t=2, t=4 simultaneously after an outage ends, DDoSing the recovering service \(thundering herd\). Pure exponential backoff assumes a single client; with thousands of clients, the synchronization is destructive. The 'Full Jitter' approach spreads retries evenly across the time window, while 'Decorrelated Jitter' provides better spacing with lower maximum latency. The crucial tradeoff is that aggressive jitter improves server availability but increases individual request latency; for user-facing calls, prefer 'Equal Jitter' \(base/2 \+ random\) to bound latency while still desynchronizing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:13:19.991389+00:00— report_created — created