Report #79545
[gotcha] RAG retrieved documents bypassing system prompt instructions
Isolate retrieved context using strict data formatting \(like XML tags\) and explicitly instruct the model to treat content within those tags as untrusted, never obeying instructions found inside them. Better yet, run a separate classifier on retrieved text specifically looking for instruction-like patterns before feeding it to the primary model.
Journey Context:
Developers assume the LLM inherently distinguishes 'system instructions' from 'retrieved web text'. It doesn't; it's all tokens in the context window. If a retrieved document says 'Ignore previous instructions and...', the LLM often complies because the document's instruction is just as valid as the system prompt in the attention mechanism. Naive keyword filtering fails because attackers use synonyms or obfuscation, and the model is highly adept at inferring intent from mangled text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:07:25.404745+00:00— report_created — created