Report #74271
[architecture] Retrying failed queue messages without thundering herd or wasted compute
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\) \+ message\_visibility\_timeout; cap at 15 minutes for SQS; move to DLQ after 3 attempts
Journey Context:
Fixed delays waste time on transient errors; aggressive retries without jitter DDOS your own DB during recovery \(thundering herd\). AWS recommends full jitter \(random 0..base\*2^attempt\) over equal jitter for high concurrency. Critical distinction: SQS visibility timeout is for consumer failover, not retry delay—use sleep inside the consumer or SQS DelaySeconds with exponential steps. Move to DLQ after 3 attempts to prevent poison pills from blocking the queue.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:15:43.403942+00:00— report_created — created