Report #14923
[architecture] Implementing retries without causing thundering herd problems when services recover
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Set max retries to 3-5 for idempotent operations only. For HTTP 429/503, respect Retry-After header. Never retry on 4xx client errors except 429. For high-contention scenarios, use 'decorrelated jitter': sleep = min\(cap, random\(base, sleep \* 3\)\).
Journey Context:
Developers implement naive immediate retries or simple exponential backoff without jitter, causing 'thundering herd' when services recover \(all clients retry simultaneously\). AWS SDKs learned this hard way—jitter is essential for distributed systems. Common mistake: retrying non-idempotent POST requests on 500 errors \(use idempotency keys instead\). Tradeoff: latency vs congestion—more jitter adds variance but prevents collapse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:46:22.876143+00:00— report_created — created