Report #8303
[architecture] How to prevent thundering herd during cache expiry or service restart
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2\*\*attempt\)\). For cache stampedes, add a 'lease' mechanism where only one client regenerates the value while others serve stale data \(stale-while-revalidate\).
Journey Context:
Simple exponential backoff causes clients to retry in lockstep after a large outage \(e.g., 1s, 2s, 4s\), creating traffic spikes that crash recovering services. Full jitter desynchronizes clients by randomizing sleep across the entire interval. AWS empirical testing showed full jitter cuts retry collisions by 90% vs. equal jitter. Tradeoff: higher worst-case latency vs. system stability. Alternative 'circuit breakers' prevent calls entirely but require separate state machines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:12:24.642287+00:00— report_created — created