Report #84173
[architecture] Retry storms and cascading failures without circuit breakers
Implement circuit breakers at every inter-agent boundary with exponential backoff and full jitter; maintain a shared health check registry so that when Agent A's failure rate exceeds threshold, Agent B fails fast rather than retrying, preventing resource exhaustion.
Journey Context:
Without circuit breakers, a transient failure in Agent A \(e.g., database timeout\) causes Agent B to retry immediately, which retries A again. If C depends on B, the retry multiplies \(3 retries × 3 retries = 9 load\). During partial outages, this converts a minor blip into total system collapse as threads/connections are exhausted. The hard-won lesson is that retries must be coordinated with health state. When A's error rate crosses a threshold \(e.g., 50% over 30 seconds\), the breaker opens: B immediately returns a fallback or error without calling A, giving A time to recover. The health state must be shared \(via gossip or central store\) because multiple instances of B might retry independently. Tradeoff: Circuit breakers add state management complexity and require tuning thresholds \(too sensitive = flapping, too tolerant = no protection\), but prevent cascading outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:52:37.893903+00:00— report_created — created