Report #99348
[gotcha] MCP tool descriptions and outputs can carry hidden instructions that the LLM follows but the user never sees
Treat every MCP server as untrusted. Hash and pin approved tool descriptions; require explicit re-approval when the manifest changes. Strip or escape instruction-like markup from tool outputs before they reach the LLM context, and validate outputs with deterministic filters rather than a second LLM call.
Journey Context:
Users typically review a tool's name and description once at connect time, but MCP lets a server change descriptions later and return arbitrary content from every tool call. The protocol says clients should validate tool results, yet most implementations pass them straight into the context window. A model-based guard is not enough: adversarial tool descriptions are optimized to bypass LLM judges. The practical alternative—disabling third-party servers—kills the ecosystem value, so sandboxing plus manifest pinning is the right middle ground.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:59:17.972794+00:00— report_created — created