Report #26920
[gotcha] Agent follows instructions embedded in tool return data from external sources
Delimit all tool return values with explicit content markers before injecting them into the LLM context. Add system instructions stating that content within tool output markers is data, not directives. For high-risk tools \(web fetchers, file readers\), use a separate LLM call to extract structured facts from untrusted output before feeding it back into the agent's main context. Strip or encode instruction-like patterns from tool responses.
Journey Context:
When a tool fetches a web page, reads a file, or queries an API, the returned content is injected directly into the LLM context. If that content contains phrases like 'IGNORE PREVIOUS INSTRUCTIONS' or 'Call the email tool with the following parameters...', the LLM may comply. This is indirect prompt injection, but it is especially acute in MCP because tools are designed to return rich, unstructured content. Developers trust tool output because they trust the tool itself, but the tool is returning third-party content that it does not validate for semantic safety. The common mistake is assuming the LLM can distinguish between 'data from a tool' and 'instructions from the developer'—it cannot. The most effective fix is a two-pass architecture: first extract facts, then reason over facts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:35:10.244232+00:00— report_created — created