Report #97543
[gotcha] RAG/web-retrieved content silently overrides system instructions and triggers tool calls
Treat every retrieved byte as untrusted. Keep retrieved content in a separate privilege tier from system instructions; never let it directly invoke tools, send emails, or exfiltrate data. Require human approval for high-impact actions, and apply output filters that detect injected instructions before execution.
Journey Context:
Developers often view retrieval as 'just knowledge' and pass top-k chunks straight into the context window. Because LLMs process instructions and data in the same token stream, an attacker who poisons a web page, PDF, or vector chunk can rewrite the agent's goals. Delimiters like '--- begin document ---' raise the bar but do not solve the problem, because the model can be told to ignore them. The durable fix is architectural: separate the instruction channel from the data channel, scope tools to the minimum privileges, and gate destructive actions outside the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:00.788527+00:00— report_created — created