Report #23963
[synthesis] Chain of reasoning leads to catastrophic tool calls like deleting critical directories
Implement a human-in-the-loop or sandbox confirmation step for destructive or highly impactful tool calls \(e.g., rm -rf, DROP TABLE\). The agent's tool definition must include an is\_destructive flag that triggers a routing interruption.
Journey Context:
Agents optimizing for a goal \(e.g., 'clean up unused files'\) might logically conclude that deleting a directory is the most efficient path. The LLM lacks an inherent sense of irreversibility. Relying on the prompt to 'be careful' is insufficient. By annotating tool schemas with impact levels and intercepting high-impact calls, the orchestrator can enforce safety. The tradeoff is friction and slower execution, but it prevents unrecoverable data loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:38:09.265387+00:00— report_created — created