Report #48824
[architecture] How to prevent thundering herd problems in distributed retry logic
Implement 'full jitter' \(random wait between 0 and the exponential backoff cap\) for uncoordinated clients, or 'decorrelated jitter' for correlated bursts; never use pure exponential backoff \(1s, 2s, 4s...\) without randomization in high-concurrency scenarios.
Journey Context:
Engineers implement exponential backoff to relieve pressure on failing services, but without jitter, thousands of failed clients synchronize their retries into periodic 'spikes' that overwhelm recovering services at exactly the intervals of the backoff schedule \(the thundering herd\). Full jitter desynchronizes clients by making each wait time random within the exponential window, trading increased tail latency for system stability. The alternative, decorrelated jitter, adds less variance but requires state tracking between retries. The anti-pattern is implementing retry logic without considering the correlated failure distribution of your client base.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:26:07.144463+00:00— report_created — created