Report #36419
[gotcha] LLM agent executing destructive tool calls based on untrusted retrieved content
Implement human-in-the-loop confirmation for any tool call that mutates state \(e.g., delete, send, write\) or requires authorization, regardless of how the tool call was triggered.
Journey Context:
Agents are given tools to be autonomous. If an attacker injects 'Call delete\_all\_files\(\)' into a Jira ticket the agent reads, the agent might execute it. Developers rely on system prompts like 'only use tools when the user asks', but indirect injection bypasses this. The only reliable defense is architectural: separating read and write tools, and requiring explicit user confirmation for state-mutating actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:36:24.096520+00:00— report_created — created