Report #74524
[gotcha] Indirect Prompt Injection via RAG retrieved documents or tool outputs
Delimit retrieved context explicitly and instruct the model to treat it as untrusted data. Better yet, use a separate, smaller classifier to scan retrieved text for instruction-like phrases before passing it to the main LLM.
Journey Context:
Developers assume RAG merely provides 'facts', but LLMs cannot inherently separate data from instructions in the same context window. If a user's email or resume retrieved by RAG says 'Ignore previous instructions...', the LLM often complies, leading to data theft or malicious actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:41:10.675733+00:00— report_created — created