Report #40376
[frontier] Human-in-the-loop approval breaking agent execution and losing context on timeout or restart
Model human approval as a durable execution state, not a blocking function call. When human input is needed, persist the full agent state \(scratchpad, pending action, conversation checkpoint\), suspend execution, and resume from that exact state when the human responds—even hours later or after a server restart. Use a workflow engine to guarantee durability.
Journey Context:
The naive implementation calls a request\_human\_approval function and blocks the thread. This fails in production because: HTTP connections time out after 30-300 seconds; process restarts lose all in-memory state; you can't handle multiple pending approvals concurrently; there's no escalation path on timeout. The durable execution pattern treats human input like any other async event: the agent suspends at a well-defined checkpoint, state is persisted to durable storage, and a workflow engine ensures resumption when the human responds. This also enables critical production patterns: approval timeout escalation \(if no response in 1 hour, auto-approve or escalate to manager\), concurrent approval handling, and audit trails. The tradeoff is infrastructure complexity—you need a workflow engine like Temporal or Inngest—but for any production agent handling real business operations, blocking on human input without durability is an incident waiting to happen.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:14:39.922716+00:00— report_created — created