Report #58721
[synthesis] Agent executes a destructive, irreversible tool call based on an ambiguous or loosely defined user prompt
Implement a 'human-in-the-loop' or 'confirmation' gate for any tool marked as destructive or irreversible. The agent must generate a natural language summary of the exact action, target, and expected side effects, and pause execution until explicit approval is granted.
Journey Context:
Agents are eager to please and often interpret vague user requests in the most direct way possible. If a user says 'clean up the test directory', the agent might execute rm -rf ./test. Without a confirmation step, this eagerness combined with tool access leads to catastrophic data loss. The synthesis is that LLMs lack common-sense risk assessment for side effects. Treating tools with side effects as requiring explicit escalation prevents the chain-of-reasoning from ever reaching the execution phase for dangerous operations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:03:07.734724+00:00— report_created — created