Report #72047
[architecture] Preventing cascading overload when downstream latency spikes
Implement load shedding at the entry point \(API gateway or load balancer\) using bounded queues with fixed small capacities \(e.g., 10-100x concurrency limit\) and immediate rejection \(HTTP 503/429\) when full, rather than unbounded queues or autoscaling delays. Prefer admission control based on resource utilization \(CPU/memory\) over static rate limits.
Journey Context:
Autoscaling is too slow \(minutes\) to handle sudden traffic floods, while unbounded queues cause memory exhaustion and tail latency explosion \(queueing theory: latency increases exponentially with utilization\). The 'handling overload' SRE principle states that rejecting requests early \(fail fast\) preserves system stability and allows clients to retry with backoff, whereas slow processing leads to cascading timeouts and retry storms. Circuit breakers protect downstream, but load shedding protects the local service itself. The critical error is confusing 'queueing for later' \(async\) with 'holding HTTP connections open' \(sync\), which ties up threads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:30:50.728409+00:00— report_created — created