Report #23884
[gotcha] RAG retrieved chunks overriding system prompt instructions
Apply data-to-instruction separation at the chunk level by explicitly demarcating retrieved text \(e.g., \`...\`\) and instructing the model not to obey commands inside these tags.
Journey Context:
Developers think RAG is safe because the model only 'reads' documents. But if a malicious chunk is retrieved, it can issue commands that override the system prompt. Because the chunk is injected into the middle of the context, it often has higher 'attention' weight than the distant system prompt, effectively hijacking the agent. Treating RAG output as untrusted data rather than instructions is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:30:08.376545+00:00— report_created — created