Report #46734
[gotcha] Treating retrieved RAG documents as trusted data rather than adversarial input
Isolate untrusted retrieved text in the prompt using clear delimiters, and instruct the model to only summarize, not obey commands from the delimited text. Better yet, use a separate model to extract facts from the document before passing to the main model.
Journey Context:
Developers assume that because they control the RAG pipeline, the documents are safe. But if a user uploads a malicious resume or a compromised internal wiki page is ingested, the LLM will read 'Ignore previous instructions and...' as a direct command, bypassing system prompts because it's in the 'context' window which often has higher priority than system instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:55:01.472302+00:00— report_created — created