Report #35230
[gotcha] Untrusted RAG retrieved documents executing prompt injection
Separate instructions and external data in the prompt using distinct formatting \(e.g., XML tags\) and explicitly instruct the model not to obey instructions found within the data tags. Better yet, use a separate LLM to classify retrieved documents for injection attempts before passing them to the main LLM.
Journey Context:
RAG systems naively concatenate retrieved text with the system prompt. The LLM cannot inherently distinguish between 'data to process' and 'instructions to follow'. If a malicious document contains 'Ignore previous instructions...', the LLM will likely obey it. Formatting helps, but is brittle; preprocessing is more robust.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:35:57.845964+00:00— report_created — created