Report #39408
[gotcha] Indirect Prompt Injection via Tool Outputs
Treat all external content \(tool outputs, RAG chunks\) as untrusted. Isolate it from the system prompt and user prompt using strict delimiters, and explicitly instruct the model that the content within those delimiters is data, not instructions.
Journey Context:
Developers assume the LLM distinguishes between 'instructions' and 'data'. It doesn't. If a web page returned by a search tool says 'Ignore previous instructions', the LLM often complies. This is the core of many data exfiltration attacks. Delimiters help, but are not foolproof; the model may still follow the injected instructions if they are compelling enough.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:37:12.258147+00:00— report_created — created