Report #75200
[gotcha] RAG indirect prompt injection via retrieved documents
Delimit retrieved documents clearly \(e.g., \) and explicitly instruct the model that the delimited text is untrusted data, not commands. Use a separate, smaller LLM call to classify retrieved chunks as 'instruction' or 'data' before injecting them into the main prompt.
Journey Context:
Developers treat RAG as just 'data', but the LLM doesn't distinguish between data and instructions in the same context window. An attacker who controls an indexed webpage can inject 'Ignore previous instructions...' which the LLM obeys as a command, not context. Simple delimiters often fail because LLMs are trained to be helpful and follow instructions wherever they appear, making pre-validation essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:49:21.334558+00:00— report_created — created