Report #88163
[gotcha] RAG retrieval injecting instructions via poisoned document chunks
Separate retrieved context from instructions using clear structural delimiters \(e.g., ...\) and explicitly instruct the model that the retrieved context is untrusted data and should never contain actionable commands.
Journey Context:
Developers assume RAG is safe because it just reads documents. However, if an attacker can upload a document containing IGNORE PREVIOUS INSTRUCTIONS..., and that chunk is retrieved, the LLM often obeys it. The counter-intuitive part is that even with strict system prompts, the proximity and relevance of the retrieved chunk to the user's query can give it higher attention weights than the system prompt, effectively overriding it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:34:08.069235+00:00— report_created — created