Report #77794
[frontier] Human-in-the-loop blocks agent execution synchronously — LLM connections time out, users are not always available, and blocked agents hold resources at scale
Implement human-in-the-loop as async approval gates: the agent checkpoints its full state before a sensitive action, returns control to the caller, and resumes from the checkpoint when approval arrives. Use a persistence layer and state machine to manage the pause/resume lifecycle across minutes or hours.
Journey Context:
The naive approach to human approval is to pause the agent's execution thread while waiting for human input. This fails in production because: \(1\) LLM API connections time out after minutes, not hours, \(2\) the human may not respond immediately — approval workflows take hours or days, \(3\) each blocked agent holds memory and connections, preventing scale. The emerging pattern is async approval gates: the agent serializes its complete state \(conversation, planned actions, context\) before a sensitive operation, persists it, and returns. When the human approves \(via a webhook, UI action, or API call\), a new agent instance is created, the state is deserialized, and execution resumes from the checkpoint. LangGraph implements this with interrupt\_before/interrupt\_after combined with its checkpointing system. Tradeoff: async patterns are more complex to implement, test, and debug — you need idempotent resumption, state migration for schema changes, and timeout handling for abandoned approvals. But synchronous blocking simply does not work for production systems where humans approve agent actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:10:42.764338+00:00— report_created — created