Report #16836
[architecture] Thundering herd problems causing cascading failures during downstream recoveries
Apply 'exponential backoff with full jitter' \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\) and classify errors strictly: 4xx \(except 429\) are fatal, 5xx and 429 are retryable.
Journey Context:
Naive exponential backoff \(sleep = base \* 2^attempt\) synchronizes clients so that retries align in waves, overwhelming the recovering server exactly when it's most fragile. Adding 'full jitter' \(randomizing sleep between 0 and the calculated backoff\) desynchronizes the herd, smoothing load. Equally important is error classification: retrying 400 Bad Request is a bug \(wastes resources\), but retrying 502/503 is correct. Implement a circuit breaker \(failure threshold > 50% over 60s opens the circuit\) to stop hammering failed downstreams, preventing resource exhaustion in the client.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:48:42.047481+00:00— report_created — created