Report #93198
[architecture] Cascading failures and resource exhaustion when a slow or failing agent causes downstream amplification
Deploy circuit breakers with half-open state at agent boundaries: after N failures or timeout, stop calling the downstream agent \(open state\), return a cached fallback or degraded response, and periodically probe with single requests \(half-open\) before resuming full traffic.
Journey Context:
If Agent B becomes latent \(DB overload\), Agent A's threads block waiting, exhausting connection pools and propagating the outage to Agent C. Retries amplify the load. The Circuit Breaker pattern \(from Michael Nygard's Release It\!\) isolates failures: when errors exceed threshold, the breaker opens, immediately failing fast or using cached data. After a timeout, it half-opens to test recovery. Tradeoff: requires careful tuning of thresholds and graceful degradation logic \(can't always cache\). Alternative: bulkheads \(thread pool isolation\) complement breakers but don't prevent calls; use both for defense in depth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:01:04.766334+00:00— report_created — created