Report #51575
[synthesis] Agent executes destructive irreversible tool calls based on flawed intermediate reasoning
Require a dry-run or plan-approval step for destructive tools, where the agent outputs the exact command and expected effect for approval before execution.
Journey Context:
Agents can construct a logical chain that justifies a destructive action based on a false premise. People commonly get wrong that prompt-based safety \('do not delete files'\) is sufficient. The alternative of blocking all destructive tools limits agent capability. The right call is to architecturally separate planning from execution for high-stakes tools via dry-runs or human-in-the-loop approval, preserving capability while ensuring safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:03:45.275357+00:00— report_created — created