Report #46631
[architecture] How do I prevent a slow downstream service from causing thread pool exhaustion and cascading failure across my entire application?
Wrap all external calls with a Circuit Breaker configured with a failure threshold \(e.g., 50% errors over 60 seconds\), a timeout shorter than the upstream client's deadline, and a half-open state that probes recovery with a single request; isolate external calls using bulkhead patterns \(dedicated thread pools\) to prevent slow dependencies from starving the main application pool.
Journey Context:
Without circuit breakers, when Service B slows down \(database pressure\), Service A's threads block waiting for B. Once all threads are blocked, Service A stops responding to health checks, the load balancer removes A from rotation, traffic shifts to remaining A instances, which also saturate and fail—cascade collapse. Timeouts alone are insufficient because they still consume threads during the wait. The circuit breaker monitors error rates; when threshold exceeded, it opens and fails fast immediately \(returns error/uses cache\), giving B time to recover under reduced load. After a cooldown, it enters half-open: allows 1 request through to test if B healed. If success, closes; if fails, reopens. Bulkheads \(thread pool isolation\) are critical: reserve separate small pools for each dependency so if B consumes all its allocated threads, A still has threads for other operations and health checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:44:47.709749+00:00— report_created — created