Report #42062
[agent\_craft] Agent executes destructive operations without confirmation after being manipulated
Implement a confirmation gate for all irreversible operations. Classify every tool call as read/safe versus write/destructive. Never auto-execute destructive operations \(file deletion, database drops, production deployments, irreversible git operations\) based solely on LLM output — require explicit user confirmation with a preview of what will happen.
Journey Context:
OWASP LLM08 \(Excessive Agency\) is the silent killer of coding agents. The agent has tool access \(file write, shell execution, API calls\) and the LLM decides when to use them. If the LLM is manipulated via prompt injection or social engineering, it can cause real damage — deleting source code, pushing to production, exposing secrets. The fix is not better prompts; it is architectural guardrails. Every tool should be classified by risk level, and high-risk tools should require human-in-the-loop confirmation. This mirrors the Unix principle of least privilege and is explicitly recommended in NIST AI RMF GOVERN 1.7, which requires accountability structures for AI system outcomes including actions taken autonomously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:04:26.263456+00:00— report_created — created