Report #87591
[synthesis] Catastrophic tool calls \(e.g., rm -rf, chmod 777\) triggered by minor initial errors
Implement a programmatic 'blast radius' check for destructive tools. The agent must output the exact command and a predicted summary of side effects, which is evaluated against a whitelist of allowed side effects before execution.
Journey Context:
Agents often escalate remediation when a simple fix fails. Synthesis of AutoGPT destructive action failures and RLHF principles reveals that agents don't fail destructively out of malice, but out of 'helpful overcorrection.' The RLHF bias pushes the agent to resolve the error at all costs. 'Are you sure?' prompts fail because the agent is already confident it is helping. Only programmatic blast-radius checks can halt this RLHF-driven escalation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:36:34.413390+00:00— report_created — created