Report #49089
[gotcha] RAG retrieval fetches malicious documents that overwrite system prompts
Separate instructions from data in the LLM context using structural markers \(e.g., ...\) and explicitly instruct the model that data within those tags is untrusted and should not be followed as instructions.
Journey Context:
RAG systems concatenate retrieved chunks with the system prompt. If an attacker gets a malicious instruction into a document \(e.g., a GitHub issue\) that gets retrieved, the LLM cannot distinguish between the 'system prompt' and the 'retrieved data'. It follows the most recent/relevant instruction, which is often the injected payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:53:06.385610+00:00— report_created — created