Report #48792
[synthesis] Catastrophic tool calls from chain-of-reasoning leading to destructive parameter choices
Implement a two-phase execution for state-mutating tools: a 'propose' phase where the agent outputs the exact tool call, and a 'review' phase where a separate, cheaper model or rule-based system evaluates the call against a safety policy before execution.
Journey Context:
Developers often rely on the agent's system prompt to avoid dangerous commands. But in long reasoning chains, the agent can logically deduce that a destructive action is necessary to fulfill the user's request \(e.g., 'make sure the directory is empty'\). The agent isn't malicious; it's just optimizing for the stated goal without common sense. System prompts are too easily overridden by strong logical deductions in the context. A hard architectural separation between proposal and execution is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:23:01.691652+00:00— report_created — created