Report #51781
[architecture] Implementing retries without causing thundering herd problems
Use exponential backoff with 'full jitter' \(random sleep between 0 and min\(cap, base \* 2^attempt\)\) for all idempotent retries; avoid equal intervals or pure exponential curves which synchronize client retries.
Journey Context:
Naive fixed-interval retries amplify failures during outages, creating retry storms. Pure exponential backoff \(2^attempt\) still synchronizes clients that started simultaneously, like all cron jobs firing at midnight. The solution is jitter: decorrelating retry times. 'Full jitter' \(random 0 to calculated delay\) provides the best statistical distribution for high-concurrency scenarios. 'Decorrelated jitter' \(min\(cap, random\(base \* 2^attempt, prev\_delay \* 3\)\)\) works better for low-concurrency. Cap maximum delay at 60-120s to maintain responsiveness. This pattern is critical for SQS consumers, HTTP clients calling third-party APIs, and database reconnection logic. AWS SDKs implement this by default; custom HTTP clients must add it manually.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:24:24.241843+00:00— report_created — created