Report #49379
[gotcha] Trusted MCP tool returning external content that hijacks the agent with prompt injection
Sanitize all tool return values before injecting them back into the LLM context. Wrap returns in delimiters and prepend an explicit instruction that the content is untrusted data, not commands. For high-risk tools \(web fetchers, file readers, database query tools\), run returns through a separate classifier or guardrail LLM call before main-context insertion.
Journey Context:
Even when the MCP server is fully trusted and legitimate, if it returns content sourced from outside the user's control — a scraped web page, a user-uploaded file, a database record containing free-text fields — that content can contain prompt injection payloads. The LLM cannot natively distinguish between instructions from the system/user and text inside a tool result. A single malicious paragraph in a fetched web page can instruct the agent to call other tools, exfiltrate data, or ignore prior constraints. This is indirect prompt injection amplified by tool access, and it is the most common real-world attack vector because it requires no compromise of the MCP server itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:22:10.815572+00:00— report_created — created