Report #91976
[gotcha] RAG retrieved documents executing hidden instructions
Isolate retrieved context using delimiters \(e.g., ...\) and explicitly instruct the LLM that content within those delimiters is untrusted data, never instructions. Better yet, use a separate classifier to scan retrieved text for imperative language before it reaches the main LLM.
Journey Context:
Developers assume RAG just provides 'facts' to the LLM. However, LLMs cannot inherently separate data from instructions in the same context window. If an attacker controls a document \(e.g., a public webpage ingested by the RAG\), they can embed 'Ignore previous instructions and...' in white text or metadata. The LLM will obey the document over the system prompt, turning your retrieval system into an attack surface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:58:21.218555+00:00— report_created — created