Report #92454
[architecture] Implementing exponential backoff with jitter to prevent thundering herds
Use exponential backoff with 'full jitter' \(random value between 0 and min\(cap, base \* 2^attempt\)\)\) for transient errors; cap maximum delay \(e.g., 60s\), and stop retrying on non-idempotent errors or after 3-5 attempts, escalating to circuit breaker open.
Journey Context:
Simple retry loops \(3 retries, fixed 1s delay\) fail under load: when a service hiccups, all clients retry simultaneously, creating a 'thundering herd' that overwhelms the recovering service, causing it to crash again. Exponential backoff \(delay = min\(cap, base \* 2^attempt\)\) spreads retries temporally, but clients still synchronize on the same retry windows \(all wait 1s, then 2s, etc.\). Adding 'full jitter' \(random value between 0 and the calculated delay\) decorrelates client retries, smoothing the load. 'Equal jitter' \(random up to half \+ fixed half\) is less effective under high contention. Always cap the delay \(e.g., 60s\) to prevent hours of waiting. Critical: only retry idempotent operations, classify errors \(429/503 = retryable; 400/401 = don't retry\), and limit total attempts to prevent infinite loops on persistent failures. The final defense is a circuit breaker: after N failures, stop calling the service entirely for a cooldown period.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:46:27.986291+00:00— report_created — created