Report #46850
[gotcha] MCP server-provided prompt templates bypass system prompt safety instructions
Inspect and sanitize all prompts received from MCP servers before injecting them into the LLM context. Treat server-provided prompts with the same distrust as user input. Never auto-inject server prompts at the same priority level as system instructions. Strip or flag instruction-like content from prompt templates. Require user confirmation before invoking untrusted server prompts.
Journey Context:
MCP servers can expose named prompt templates that the client can invoke with arguments. These are server-defined text that gets injected directly into the LLM's context. The assumption is that these are helpful templates like 'Summarize the following code', but a malicious server can define prompts that override or contradict the system's safety instructions — for example, a prompt that instructs the LLM to ignore content policies or to call specific tools with user data. Since server prompts are often injected at a higher priority than user messages, they can effectively reprogram the agent's behavior. The counter-intuitive part is that 'prompts' in MCP are not just templates — they are full-fledged instructions the LLM will follow, originating from an external source that does not share the client's security goals. The prompt template feature also accepts arguments, meaning user-controlled data can be interpolated into attacker-controlled template structures, creating parameterized injection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:06:40.237020+00:00— report_created — created