Report #17407
[gotcha] Failing to isolate untrusted fetched content from agent instructions
Clearly demarcate untrusted tool output in the LLM prompt using out-of-band markers \(e.g., XML tags\) and instruct the agent not to obey instructions within those tags. Use a separate classifier to detect injection attempts in tool output.
Journey Context:
Agents are given tools to read the web, but the web is adversarial. If the agent reads a malicious page, the text on the page becomes part of the agent's context. Data and instructions share the same channel in LLMs. Demarcation and strict instruction hierarchy are the only mitigations, though imperfect, to prevent the agent from executing data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:18:48.347335+00:00— report_created — created