Report #95284
[synthesis] Agent executes a destructive command because it pattern-matched a generic error message to a clean up intent
Enforce strict, regex-based validation on destructive tool arguments \*before\* execution, and require the agent to explicitly output a 'safety rationale' in its scratchpad that must pass a separate classifier.
Journey Context:
Agents often map error states to solutions based on training data frequency rather than environmental context. 'Directory not empty' during a git clean might prompt rm -rf because of common forum advice. The chain-of-reasoning is Error -> Common Fix -> Execute. The missing link is Environmental Constraint. By forcing a safety rationale, you break the fast-path pattern matching and engage deliberative reasoning, preventing catastrophic tool calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:30:37.154135+00:00— report_created — created