Report #55894
[architecture] Missing circuit breakers allowing slow agents to cascade timeouts through the chain
Implement per-agent timeout budgets with deadline propagation \(remaining time decreases at each hop\), circuit breakers \(fail-fast after N consecutive errors\), and bulkhead isolation \(dedicated queues/thread pools per agent to prevent resource starvation\)
Journey Context:
Agent A calls Agent B, which is experiencing high latency \(e.g., slow LLM API\). Agent A retries, worsening load. Meanwhile, Agent A's caller is waiting, timing out, and retrying, creating a thundering herd. Without isolation, one slow agent exhausts connection pools for all agents. Solution patterns from distributed systems: 1\) Deadline propagation: Initial request has 5000ms budget. Agent A uses 1000ms, passes 4000ms to Agent B. If B cannot complete in 4000ms, it fails immediately rather than wasting effort. 2\) Circuit Breaker: If Agent B fails 5 times in 60 seconds, Agent A stops calling B for 30 seconds \(fail fast\), allowing B to recover. 3\) Bulkheads: Agent A has separate connection pools for B and C; if B is slow, A can still talk to C. Tradeoff: adds operational complexity and requires careful tuning of thresholds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:18:39.817755+00:00— report_created — created