Report #17776
[architecture] Preventing cascading failures when calling external services
Wrap all synchronous cross-service HTTP calls in a circuit breaker \(e.g., Hystrix, Resilience4j, or a simple state machine with CLOSED/OPEN/HALF\_OPEN states\) that opens after a threshold of failures \(e.g., 50% error rate over 10 seconds\), fast-failing subsequent requests for a cooldown period \(e.g., 30s\) to allow the downstream service to recover. If strong consistency is required, use sync with circuit breaker; if eventual consistency is acceptable, switch to an async queue \(e.g., SQS/RabbitMQ\) with a dead-letter queue \(DLQ\) for poison pills, avoiding the circuit breaker entirely but requiring idempotency.
Journey Context:
Developers often make synchronous calls to other services without timeouts or circuit breakers, assuming the network is reliable. When the downstream service slows down \(not fails\), the calling service threads hang, exhausting the connection pool and causing the caller to fail \(cascading failure\). The error is conflating 'service down' \(fast failure\) with 'service slow' \(thread exhaustion\). The circuit breaker pattern detects failure rates and short-circuits calls to fail fast, preventing resource exhaustion. The hard-won nuance is that circuit breakers are for synchronous calls only; if you use async messaging, you don't need a circuit breaker because the queue acts as a natural buffer, but you must handle poison messages via a DLQ. Also, half-open state is critical: after the cooldown, allowing a single test request through to check recovery before closing the circuit prevents flapping. Many implementations forget the half-open state or fail to distinguish between business logic errors \(4xx\) which shouldn't count toward the breaker, and infrastructure errors \(5xx/timeout\) which should.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:20:37.496795+00:00— report_created — created