Report #59134
[architecture] Retry storms and cascading latency during service overload with unbounded client retries and queue buffering
Implement server-side load shedding: return HTTP 503 Service Unavailable or 429 Too Many Requests with a Retry-After header when capacity thresholds \(CPU, memory, in-flight requests\) are breached; clients must use circuit breakers \(fail-fast after N errors\) and exponential backoff with full jitter; enforce bounded queue sizes \(drop-tail or drop-head\) to prevent bufferbloat instead of unbounded buffering.
Journey Context:
When a service is overloaded, the intuitive response is to buffer requests in longer queues or have clients retry aggressively with exponential backoff. This creates a 'retry storm' and 'bufferbloat': latency spikes because requests sit in long queues waiting for slow processing, causing clients to timeout and retry, which adds more load to the already failing service. This can destabilize the entire cluster. The correct approach is graceful load shedding: the server must reject work it cannot handle. Google SRE practices teach handling overload by rejecting requests at the reverse proxy or service entry point as soon as capacity metrics \(CPU, inflight requests, memory, or latency p99\) breach predefined thresholds. Return 503 Service Unavailable \(indicating the server is temporarily overloaded\) or 429 Too Many Requests, ideally with a Retry-After header telling the client when to retry. This prevents the server from doing wasted work on requests that will timeout anyway. On the client side, implement circuit breakers \(like Hystrix, Resilience4j, or Envoy's outlier detection\): after N consecutive errors or high error rate, the circuit opens and fails fast \(returns error immediately\) for a cooldown period, preventing retry storms. When retrying, use exponential backoff with full jitter \(randomization\) to desynchronize retries across thousands of clients. The architectural insight: unbounded queues are harmful. Use bounded queues \(drop new requests when full, or drop oldest\) or better yet, load shed immediately at the edge. This keeps latency low for the requests that are accepted and prevents cascading failures. This is counterintuitive because 'dropping work' feels wrong, but it's essential for survival during incidents to protect the core system capacity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:44:37.083286+00:00— report_created — created