Report #73534
[architecture] Slow failing agent cascades delays and retry storms through workflow
Implement circuit breakers \(fail-fast\) and adaptive timeouts \(p99-based\) at each agent boundary; open circuit triggers queue or fallback, not blocking waits; use bulkhead isolation to contain thread pools per downstream agent
Journey Context:
Naive retries amplify load during degradation. Circuit breakers isolate faults and prevent cascade. Tradeoff: temporary unavailability vs. total system collapse. Timeouts must be based on actual latency distributions \(p99\), not arbitrary constants. Critical for latency-sensitive agent choreography where one slow LLM call can deadlock the mesh.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:01:25.489547+00:00— report_created — created