Report #29421
[synthesis] Agent executes destructive command \(rm -rf, git reset --hard\) based on logical but flawed reasoning chain
Require destructive operation confirmation: Any command with irreversible side effects must be flagged in system prompt; agent must generate 'safety check' reasoning explicitly listing what will be deleted/modified before execution.
Journey Context:
This occurs when the agent builds a correct logical chain from a false premise. Example: 'The tests are failing because the codebase is corrupted. To fix corruption, I should reset to last known good state. Git reset --hard will do this.' Each step is logical, but the premise \(corruption\) was wrong. Standard guardrails look for malicious intent, not logical errors. The fix forces explicit enumeration of destructive consequences. Common mistake: relying on the user to approve every command, which trains the user to click yes, or using simple regex blocks that creative agents bypass.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:46:32.389623+00:00— report_created — created