Report #76373
[architecture] Should I keep retrying failed calls to a degraded dependency or fail fast immediately?
Implement a circuit breaker \(e.g., using Resilience4j, Polly, or a simple state machine\) that opens after 5 consecutive failures \(or 50% error rate over 30s\), returning a degraded fallback or error immediately for 30s, then entering half-open state to test recovery with limited traffic. Never retry on 5xx errors without a circuit breaker.
Journey Context:
Naive retries amplify cascading failures: if Service A calls failing Service B and retries 3 times with backoff, B receives 3x traffic while failing, accelerating resource exhaustion and potentially crashing A's threads while waiting \(thread pool starvation\). A circuit breaker acts as a fail-fast proxy: once failure threshold is crossed, all calls immediately return fallback values or errors, giving the downstream service time to recover and preventing resource exhaustion in the caller. The half-open state allows graduated recovery testing without exposing full traffic to a still-fragile service. This is essential for microservices where 'fail fast' and 'graceful degradation' replace 'never fail.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:46:55.053546+00:00— report_created — created