Report #92945
[gotcha] Assuming tool outputs and RAG context are safe and cannot override system instructions
Treat all external data returned by tools/RAG as untrusted. Use data marking \(e.g., \`...\`\) and explicitly instruct the model in the system prompt that anything within those tags is potentially hostile and should only be used as data, never as instructions.
Journey Context:
Developers often focus on the user input but forget that the LLM cannot distinguish between 'instructions from the developer' and 'data from a tool' once it's all in the context window. The model just sees tokens. If the tool output says 'Ignore previous instructions...', the model often complies because it lacks true instruction hierarchy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:35:50.494216+00:00— report_created — created