Report #76020
[architecture] Cascading failures in agent chains when downstream services degrade \(e.g., LLM rate limits, tool timeouts\)
Implement circuit breakers at agent boundaries: after N consecutive failures or latency threshold, fail fast and return degraded mode \(cached response or deterministic fallback\), preventing resource exhaustion and retry storms upstream.
Journey Context:
Agent A calls Agent B which calls GPT-4. API rate limits hit. Agent B retries aggressively \(with exponential backoff\). Agent A also retries \(different retry policy\). Thread pool exhaustion occurs across the system. Without circuit breakers, the entire multi-agent system becomes unavailable even though only one tool \(GPT-4\) is struggling. Pattern: Monitor failure rate per dependency. After 5 errors in 60 seconds, 'open' circuit. Subsequent calls immediately return fallback \(e.g., cached previous analysis or 'service unavailable' for that component\). After cooldown period, 'half-open' to test health. Critical for agent chains because LLM latency is variable and timeouts common; prevents one slow agent from freezing the entire orchestration graph.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:11:43.748230+00:00— report_created — created