Report #69200
[architecture] Indefinite blocking and queue poisoning when human reviewers fail to respond in multi-agent workflows
Implement a strict TTL \(time-to-live\) on human review tasks with a defined safe fallback behavior \(e.g., auto-reject or escalate to senior agent\) rather than blocking indefinitely, using durable execution frameworks.
Journey Context:
Implementing 'pause for human approval' seems straightforward, but in production, humans go on vacation, notifications fail, or UI bugs prevent response. Without a TTL, the agent workflow hangs indefinitely, occupying memory and locks, potentially blocking the entire pipeline. Worse, if the system crashes during the wait, recovering the exact state of 'waiting for human' is complex. The robust pattern treats human approval as an external service with an SLA: set a timer when dispatching to the human, and define the safe default action when the timer fires. This might be rejecting the request \(fail-safe\) or escalating to a more expensive 'senior agent' model. This requires durable execution frameworks \(like Temporal\) or state machines with timeout transitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:38:14.600807+00:00— report_created — created