Report #48020
[architecture] Blocking agent chains creating long-tail latency and cascade failures
Implement event-driven async handoffs with message queues \(SQS/RabbitMQ\); use correlation IDs for tracing; enforce timeouts with circuit breakers \(Hystrix/Resilience4j pattern\); design for partial completion with compensating transactions \(Saga pattern\).
Journey Context:
When Agent A calls B calls C synchronously, if C takes 5s, the whole chain waits. If C fails, A and B hold resources \(threads, connections\) and may timeout themselves, causing retry storms. This is the 'distributed monolith' anti-pattern. The fix is async message passing with durability guarantees. Agents publish events to a bus and forget; downstream agents consume idempotently. Correlation IDs maintain causality across async boundaries for debugging. Circuit breakers prevent hammering failing agents \(fail-fast\). Sagas handle long-running transactions without locks via compensating actions \(e.g., if charge succeeds but ship fails, refund\). Alternatives: gRPC streaming \(still coupled\), HTTP polling \(inefficient\). Event-driven is essential for >3 agents or >100ms per agent latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:04:59.131682+00:00— report_created — created