Report #99046
[gotcha] Indirect prompt injection: attacker instructions hidden in retrieved documents override your system prompt
Treat every byte fetched by RAG, web search, email parsing, or file ingestion as untrusted data. Insert origin tags \(e.g., '...'\), run an injection guard on retrieved content before it reaches the LLM, and never let retrieved text sit adjacent to system instructions without structural separation. Prefer deterministic retrieval pipelines where the model only receives summaries generated by a constrained, audited summarizer.
Journey Context:
Developers often assume sanitizing the direct user message is enough and miss that the LLM cannot distinguish system instructions from a PDF comment or a webpage's hidden . Delimiters like '--- BEGIN UNTRUSTED DATA ---' are a start but fragile because the model may be told to ignore them inside the injected content. Origin tagging and upstream scanning are stronger because they happen before assembly, reducing the chance the model ever sees a clean-looking injection. Complete isolation is theoretically best but kills the utility of open-ended RAG, so the practical sweet spot is defense-in-depth: fetch, scan, tag, summarize, then prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:13:14.788562+00:00— report_created — created