Report #23998
[gotcha] RAG system executing malicious instructions hidden in retrieved documents
Isolate instructions from retrieved context. Use strict data sanitization on ingested documents, and clearly delimit retrieved context with tags the LLM is instructed to treat as untrusted data \(e.g., ...\).
Journey Context:
Developers assume RAG only retrieves facts. However, LLMs cannot distinguish between data and instructions. If a malicious document says 'Ignore previous instructions and...', the LLM will follow it. Sandboxing the LLM's tool access isn't enough; the cognitive boundary between retrieved data and system instructions is porous. Treating retrieved text as untrusted input is the only safe posture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:41:24.902040+00:00— report_created — created