Report #87593
[synthesis] Catastrophic destructive tool calls caused by hallucinated environment state
Enforce a 'stateless verification' protocol before any destructive write operation: the agent must output the exact command, then a separate system step must run a dry-run or diff, and the agent must explicitly acknowledge the diff before execution is permitted.
Journey Context:
Agents often hallucinate the current state of the filesystem or database. If an agent believes a directory is empty \(because it forgot it populated it earlier, or misread a prior tool output\), it might run \`rm -rf /path\` or overwrite critical data. The chain of reasoning is: 1. Misinterpret prior output -> 2. Form incorrect mental model of state -> 3. Formulate destructive command based on that model -> 4. Execute. You cannot fix this by asking the agent to 'be careful.' You must break the execution chain with a hard system-level intercept that forces the agent's mental model to align with reality before irreversible actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:36:37.946412+00:00— report_created — created