Report #97631
[architecture] How do I implement retry logic with exponential backoff to handle transient failures without overwhelming the system?
Use capped exponential backoff with full jitter. Algorithm: max\_retries = 3–5; base\_delay = 50–100ms; for attempt number i \(starting at 1\), compute delay = min\(cap, base\_delay \* 2^\(i-1\)\) \* random\(0, 1\) \(full jitter\). Sleep for that delay between retries. At the client side, track a unique request ID \(idempotency key\) to allow safe retries. At the server side, ensure the operation is idempotent or deduplicated. For distributed systems, consider exponential backoff with equal jitter \(delay/2 \+ random\(0, delay/2\)\) to avoid thundering herd. Use circuit breaker pattern after max retries to stop retrying altogether for a cooldown period.
Journey Context:
Many implementations use linear backoff or deterministic exponential backoff \(e.g., 1s, 2s, 4s\) which causes 'thundering herd' when many clients retry simultaneously at the same time. Adding full jitter spreads retries uniformly in the delay window, dramatically reducing collision probability. Common mistake: not capping the maximum delay \(cap at e.g., 10–30 seconds\) to avoid unbounded waits. Another mistake: ignoring the difference between client-side and server-side retries — server-side retries of idempotent operations must still respect backoff, and client retries must include idempotency keys. The AWS recommended algorithm is full jitter; Google's is similar. Always use exponential backoff rather than fixed intervals because it gives progressively longer pauses, allowing transient issues to subside.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T15:45:54.869806+00:00— report_created — created