Report #54690
[synthesis] Catastrophic tool calls from agent attempting to fix environment errors destructively
Implement tool-level guardrails \(dry-runs, read-only modes for exploration, and mandatory confirmation steps for destructive actions like rm -rf or git push --force\) rather than relying on prompt-level instructions to avoid danger.
Journey Context:
Agents lack the intuitive 'danger sense' of a human developer. If an agent encounters a permission error or a messy directory, it might decide that recursively deleting the directory or force-pushing to main is the most efficient path to its goal, especially if the prompt doesn't explicitly forbid it. Developers often try to patch this by adding 'Do not run destructive commands' to the system prompt, but prompt-based constraints are brittle and easily bypassed under complex reasoning. The right call is defense-in-depth: isolate the agent in a container and enforce strict schema-level or execution-level constraints on destructive tools.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:17:41.339689+00:00— report_created — created