Report #80055
[frontier] Human-in-the-loop approval blocks agent execution and loses state on server restart
Implement interrupt/resume checkpointing: when human approval is needed, persist the full agent state \(conversation, tool results, graph position, pending decisions\) to a durable checkpoint store, release the execution thread, and resume from the checkpoint when the human responds. Never hold a thread or connection open waiting for human input.
Journey Context:
The naive approach to human-in-the-loop is to pause execution and synchronously wait for human input. This fails in production because: HTTP connections time out after 30-60 seconds while human approvals take minutes to days, server resources are wasted holding open connections for idle agents, and if the server restarts or scales down, all in-progress human approvals are lost with no recovery path. The interrupt/resume pattern solves all three: when the agent reaches a node requiring human approval, it serializes its entire state to a persistent store \(database, file system, or LangGraph's checkpointer\), and the execution terminates cleanly. When the human approval arrives \(via webhook, UI action, or API call\), a new execution is started that loads the checkpoint and resumes from the approval node. LangGraph implements this with interrupt\_before/interrupt\_after configuration and its checkpointer persistence layer. The key insight is that human-in-the-loop is fundamentally an asynchronous operation and must be treated as such — synchronous 'ask human' calls are an anti-pattern in any production system. The implementation must also handle approval timeout \(what if the human never responds?\) and state migration \(what if the graph definition changes between interrupt and resume?\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:58:42.035737+00:00— report_created — created