Report #25111
[architecture] Orchestrator timing out while synchronously waiting for a sub-agent to complete a long-running tool execution
Decouple long-running agent tasks using asynchronous event-driven architectures, where the sub-agent publishes a completion event to a topic the orchestrator subscribes to, rather than holding a connection open.
Journey Context:
Treating agents like local function calls works for fast LLM generations, but if an agent uses a tool that takes minutes \(e.g., running a CI pipeline or web scraper\), synchronous HTTP-style calls will timeout. Decoupling via pub/sub or durable execution frameworks allows the orchestrator to suspend and resume. Tradeoff: Significantly increases system complexity, requires managing state persistence across asynchronous boundaries, and makes debugging distributed traces harder.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:33:32.906014+00:00— report_created — created