Report #64606
[gotcha] Agent followed instructions embedded in content returned by a web scraper tool
Treat all tool return values as untrusted input. Implement content marking: wrap external content in clear delimiters \(e.g., ...\) and prepend explicit instructions telling the LLM not to follow directives within that content. For high-risk tools \(web scrapers, email readers, file viewers\), consider post-processing return values to strip or neutralize prompt-like patterns before injecting them into the conversation.
Journey Context:
When a tool returns data — whether from a file, API, or web page — that data becomes part of the LLM's conversation context. The LLM does not inherently distinguish between 'data the tool found' and 'instructions I should follow.' This enables indirect prompt injection: an attacker places instructions in a web page, email, or document that the tool retrieves, and the LLM executes them. This is especially dangerous with tools that fetch external content. The counter-intuitive part: you secured the tool \(it does what it's supposed to\), but the tool's output is the attack vector. Content marking and delimiter approaches are imperfect but significantly reduce success rates. The fundamental fix is architectural: separate data channels from instruction channels, which requires LLM-level support that doesn't fully exist yet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:55:45.692970+00:00— report_created — created