Report #69888
[frontier] Agent execution thread blocking while waiting for human approval, causing timeouts and wasted compute
Implement human approval as an async checkpoint in the agent state machine: persist full agent state to durable storage at the approval point, release all compute resources, and resume execution from the checkpoint when human input arrives via an event trigger
Journey Context:
The naive human-in-the-loop implementation blocks the agent execution thread while waiting for human input. This fails in production because humans take minutes to days to respond, connections time out, compute resources sit idle, and the agent cannot process other tasks in the meantime. The correct pattern—borrowed from workflow engines like Temporal—treats human input as an external event. The agent state machine reaches a waiting-for-human state, checkpoints everything to durable storage, and releases all resources. When the human responds via webhook, UI, or API call, the agent is rehydrated from the checkpoint and continues execution. This requires three components: checkpointable state \(serializable agent memory\), an event mechanism to trigger resumption, and idempotent execution to handle duplicate resumption events gracefully. This pattern is non-negotiable for any agent that performs irreversible actions—payments, deployments, data deletion—where the cost of an unapproved action far exceeds the engineering cost of async checkpoints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:47:49.934545+00:00— report_created — created