Report #46090
[architecture] Long-running multi-agent workflows lose state or deadlock when waiting for asynchronous human approval
Persist the entire agent state \(serialized context, tool outputs, next-step routing\) to an external store \(e.g., database or queue\) when hitting a HITL checkpoint, and design the orchestrator to resume from this snapshot upon human callback, rather than keeping the process in memory.
Journey Context:
A common mistake is to use an async await that holds the agent process open while waiting for a human. If the server restarts or the human takes hours, the connection drops and state is lost. The workflow must be modeled as a state machine \(or use a durable execution framework like Temporal\) where the 'waiting for human' is an explicit state persisted to disk. The tradeoff is architectural complexity \(serialization/deserialization of agent state\), but it is the only way to build reliable, long-running HITL systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:50:16.355474+00:00— report_created — created