Report #59395
[architecture] Cascading failures when slow or dead agents block entire workflows
Implement per-agent timeouts with circuit breakers: after N consecutive failures, fast-fail subsequent calls and trigger fallback agent or degradation mode; use half-open state to test recovery
Journey Context:
Simple global timeout doesn't distinguish between slow and failed. Without circuit breakers, retry storms kill recovering services. This is standard distributed systems but often missed in agent orchestration where 'agent is thinking' excuses long delays. Critical for cost control \(LLM API costs accrue during waits\) and liveness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:11:16.062519+00:00— report_created — created