Report #60489
[gotcha] RAG retrieved documents treated as trusted data
Isolate retrieved context in the prompt using strict XML tags and explicitly instruct the model to treat content within those tags as untrusted, potentially adversarial data; better yet, use a separate LLM to summarize/extract facts from retrieved docs before passing to the primary LLM.
Journey Context:
Developers assume RAG just provides facts, but the LLM cannot distinguish between an instruction in the system prompt and an instruction embedded in a retrieved document. Attackers SEO-poison or inject instructions into data sources \(e.g., Jira tickets, web pages\) that the RAG pipeline fetches. The LLM happily obeys the retrieved instruction, overriding prior constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:01:21.164787+00:00— report_created — created