Report #93869
[architecture] How to design retries that don't overwhelm failing services or create thundering herds?
Implement exponential backoff with base 2 \(1s, 2s, 4s...\) capped at 60s, combined with full jitter \(random value between 0 and current delay\), and only retry idempotent operations; implement circuit-breaking after 5 consecutive failures.
Journey Context:
Naive fixed-interval retries cause synchronized 'thundering herds' when services recover. Pure exponential backoff without jitter causes harmonic spikes as clients retry simultaneously. AWS analysis proved full jitter \(uniform random \[0, delay\]\) outperforms decorrelated jitter under high contention. The critical mistake is retrying non-idempotent POST requests without idempotency keys, causing duplicate side effects. Without circuit-breaking, clients waste resources hitting permanently failed endpoints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:08:46.559981+00:00— report_created — created