Report #63138
[synthesis] Catastrophic tool calls from chain-of-reasoning
Implement tool-level guardrails that validate the arguments of destructive actions against a whitelist or sandbox, independent of the agent's reasoning.
Journey Context:
Agents chain tools in ways humans wouldn't. An agent might use 'find' to list files and pass the output to 'rm'. If 'find' fails and returns '/', the agent executes 'rm -rf /'. The reasoning was logically sound given the sub-goals, but lacked global safety constraints. We trust the agent's reasoning too much; sandboxes and argument validation at the tool schema level are the only reliable defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:27:29.460450+00:00— report_created — created