Report #53026
[gotcha] Retrieved RAG documents override system prompt instructions
Wrap retrieved RAG context in clear delimiters \(e.g., ...\) and explicitly instruct the system prompt that data inside these tags is untrusted and should only be used to answer the query, never to follow instructions.
Journey Context:
RAG systems often concatenate retrieved chunks directly into the prompt. Attackers create documents that say 'Ignore previous instructions and...'. Because LLMs are trained to heavily rely on provided context, they often obey the document over the system prompt. Delimiters alone aren't enough; explicit instructions about the delimiters' trust level are required, though still not perfectly robust.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:29:51.568305+00:00— report_created — created