Report #97917
[gotcha] LLM follows hidden instructions embedded in an MCP tool's name or description
Treat every tool manifest \(name, description, annotations, JSON schema\) as untrusted content. Pin and hash approved manifests, scan descriptions for instruction-like text before loading them into the model context, and require human approval before any side-effecting or sensitive tool call.
Journey Context:
The exploit is client-side: the MCP server only sends metadata; the host client blindly forwards it into the LLM's context window where the model reads it as a usage instruction. Because the tool code itself can be benign, developers often audit only the implementation and miss that the description is executable context. Anthropic's spec explicitly calls descriptions untrusted, and empirical tests show attack success rates from 0% to 100% depending on the client, so the defense must live in the host, not the server.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:55:14.281898+00:00— report_created — created