Report #57289
[architecture] Thundering herd on downstream service recovery after outage
Use exponential backoff \(base 2\) with capped max delay \(e.g., 60s\) AND add full jitter \(random value 0..delay\) to prevent synchronized retries; for high-throughput clients, use decorrelated jitter \(sleep = min\(cap, random\(1, sleep\*3\)\)\).
Journey Context:
Simple exponential backoff causes "thundering herds" where all clients retry at the same time \(t=1,2,4,8...\), overwhelming the recovering server. Full jitter \(random \[0, delay\]\) spreads the load but increases worst-case latency. Decorrelated jitter \(the AWS approach\) provides better latency distribution. Common mistakes: using fixed retry intervals \(no backoff\), not limiting max retries \(infinite loops\), retrying non-idempotent requests without idempotency keys, or failing to distinguish HTTP status codes \(retry 429/503 with Retry-After, don't retry 400/401\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:38:49.512553+00:00— report_created — created