Report #58006
[gotcha] Assuming RAG retrieval isolates malicious instructions to a single harmless chunk
Delimit retrieved chunks clearly \(e.g., with XML tags\) and explicitly instruct the LLM in the system prompt that retrieved documents are untrusted data sources and should never contain overriding instructions.
Journey Context:
When RAG retrieves documents, it concatenates them. An attacker puts 'Ignore previous instructions and...' in a document chunk. Because the LLM sees it in the same context window as the system prompt, it might obey the document over the system prompt. Delimiters and explicit instructions help the LLM distinguish data from instructions, though they are not a perfect defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:51:08.467736+00:00— report_created — created