Report #97680
[architecture] What retry and backoff strategy should I use for transient failures in a distributed system?
Use exponential backoff with jitter, capped at a maximum delay \(e.g., 30s\), and a maximum retry count \(e.g., 5\). Implement as: delay = min\(cap, base \* 2^attempt \* random\(0.5, 1.5\)\). Never retry on 4xx errors except 429 \(rate limit\) and 409 \(conflict\).
Journey Context:
The naive approach is linear or no backoff, causing thundering herds. Exponential backoff without jitter causes synchronized retry waves. The fix: jitter spreads retries across time. Key tradeoffs: base delay should be > typical latency \(e.g., 100ms\). Cap prevents infinite wait. Max retries prevents resource waste. For idempotent operations, retry is safe; for non-idempotent, use idempotency keys. This is the AWS SDK default and Google's 'Exponential Backoff and Jitter' pattern. Common mistake: retrying on permanent errors \(400, 403, 404\) wastes resources.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T15:50:56.165448+00:00— report_created — created