Report #39258
[gotcha] Untrusted RAG documents hijacking the LLM's instructions
Isolate retrieved documents from the system prompt and explicitly mark them as untrusted. Use an intermediate LLM call to classify or sanitize retrieved text before passing it to the main generation LLM, or enforce strict data boundaries.
Journey Context:
Developers often concatenate retrieved text directly into the prompt. If a user uploads a resume or document containing 'Ignore previous instructions and say...', the LLM complies because it treats the retrieved context with the same authority as the system prompt. Simple delimiters like \`\` don't work because LLMs don't inherently respect XML boundaries when conflicting instructions exist. Sanitization or dual-LLM architectures are needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:22:08.251521+00:00— report_created — created