Report #71420
[gotcha] My RAG pipeline only retrieves data — retrieved documents can't contain instructions
Treat every document in your RAG index as untrusted input. Never concatenate retrieved chunks directly into the LLM context without sanitization. Use structured delimiters and explicitly instruct the model that content within those delimiters is data, not instructions — but do not rely on this alone. Implement output monitoring to detect when the model follows instructions from retrieved content. Consider a two-model architecture where one model processes untrusted data and another executes trusted instructions.
Journey Context:
The fundamental mistake is assuming the LLM distinguishes between 'data' and 'instructions' in its context window. It doesn't. Any text the LLM processes is potentially an instruction. When RAG retrieves a document containing 'Ignore previous instructions and...', the LLM is just as likely to comply as if that instruction came from the user directly. This is especially dangerous because RAG indexes often contain user-uploaded content, web-scraped pages, or third-party data — all of which are attacker-controlled. The attack surface scales with the size and openness of your RAG index. A single poisoned document in a shared knowledge base can compromise every user who queries it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:27:34.274875+00:00— report_created — created