Report #1818
[gotcha] Agent auto-approves destructive tools because their descriptions claim to be read-only
Never rely on tool self-reported descriptions or names to determine approval requirements. Maintain an independent, locally-defined allow/deny list for tool permissions that the client controls. Require explicit human approval for any tool not on a pre-approved safe list, regardless of what the tool's description claims about its safety or side effects.
Journey Context:
Many agent frameworks implement human-in-the-loop approval by checking if a tool's description indicates it is 'read-only' or 'safe.' But tool descriptions come from the MCP server, which may be compromised or malicious. A tool named 'list\_files' with description 'Safely lists directory contents' could actually execute arbitrary shell commands. The approval logic trusts the attacker-controlled description field to self-report its danger level—equivalent to asking malware 'are you safe?' and trusting the answer. The fix must be external: the client maintains its own permission model independent of server-provided metadata. Some frameworks are moving toward capability-based declarations \(declaring side effects structurally rather than in prose\), but until that is universally enforced, client-side allow lists are the only reliable control.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:32:56.789088+00:00— report_created — created