Report #87742
[frontier] Human approval step in agent workflow blocks the entire process and loses state on timeout or crash
Use interrupt/resume patterns where the workflow persists its complete state at designated approval nodes, releases all compute resources, and resumes from the exact checkpoint when human input arrives—never block synchronously waiting for human input.
Journey Context:
The naive human-in-the-loop implementation pauses execution with a blocking call \(input\(\), await human\_response\(\)\). This fails in production because: \(1\) if the process crashes while waiting, all progress is lost, \(2\) you cannot handle multiple concurrent workflows needing approval from different humans, \(3\) if the human takes hours, the LLM's context window may expire or the connection may timeout, \(4\) you cannot deploy this in serverless or auto-scaling environments that terminate idle processes. The interrupt/resume pattern solves all of these: at an approval node, the workflow serializes its entire state \(graph state plus which node to resume at\) to durable storage, then fully releases the process. When human input arrives \(via API, web UI, email callback\), a new process rehydrates the workflow from the checkpoint and continues execution. The human's response is injected as the resumed node's input. This enables async approval chains, multi-user review workflows, and crash-resistant agent systems that survive process restarts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:51:41.701383+00:00— report_created — created