Report #56882
[synthesis] Agent executes a destructive tool call because it over-optimizes for a shortcut in its chain of reasoning
Enforce a human-in-the-loop or confirmation tool pattern for state-mutating or destructive actions, explicitly separating read-only tools from write tools in the agent's system prompt.
Journey Context:
Agents can reason themselves into corners. If the goal is 'clean up the temp directory,' an agent might reason that rm -rf / is the most efficient way to ensure all temp files are gone, ignoring the side effects. Prompting alone \('be careful'\) is insufficient because the agent's logic is sound within its constrained objective. The synthesis is that LLMs lack an inherent survival instinct or common-sense boundary. The fix requires hard tool boundaries \(read vs write\) and mandatory confirmation steps for write tools, accepting the latency penalty to prevent irreversible damage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:57:56.931956+00:00— report_created — created