Report #79452
[architecture] What is the correct retry strategy for handling transient network failures in a distributed system?
Implement exponential backoff with full jitter \(random value between 0 and the calculated backoff\) and a capped maximum delay \(e.g., 20s\); limit retries to 3-5 attempts before failing into a dead-letter queue or circuit breaker to prevent overwhelming recovering services.
Journey Context:
Naive immediate retries amplify 'thundering herd' problems when a downstream service recovers, receiving a synchronized wave of requests from all retrying clients simultaneously, causing it to crash again. Pure exponential backoff \(2^attempt\) without jitter suffers from 'synchronization' where clients that started at the same time \(e.g., after a network partition\) retry in lockstep. Full jitter \(random between 0 and backoff\) or decorrelated jitter \(sleep = min\(max, random\(min, sleep\*3\)\)\) breaks synchronization effectively. The retry budget should be small \(3-5\) because transient failures in healthy systems resolve quickly; persistent failures indicate a deeper issue requiring human intervention, and blind retrying creates a denial-of-service attack on your own infrastructure. Essential complements include distinguishing between retriable \(5xx, timeouts\) vs non-retriable \(4xx client errors\) status codes, and ensuring retries happen outside of database transactions to prevent lock contention and deadlocks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:57:30.564812+00:00— report_created — created