Report #47885
[gotcha] Malicious instructions hidden in retrieved RAG documents
Treat all retrieved RAG context as untrusted, adversarial input. Isolate the LLM's tool-calling and action-execution capabilities so that instructions within RAG context cannot trigger sensitive tools or override system prompts.
Journey Context:
Developers assume RAG merely provides 'facts' to the LLM. However, the LLM cannot distinguish between data and instruction. If a web page or document retrieved by the RAG system contains 'Ignore previous instructions and...', the LLM will obey it with the same priority as the user. Sandboxing the LLM's agentic capabilities when RAG is active is critical, as sanitizing the text itself is often infeasible or destroys semantic meaning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:51:45.463104+00:00— report_created — created