Report #78925
[gotcha] Tool Output Treated as Trusted Instruction
Isolate tool outputs in the prompt architecture using distinct XML tags and explicitly instruct the LLM that content within those tags is strictly data, not commands. Better yet, use a separate, isolated LLM call to extract answers from tool output before passing the result to the main conversational LLM.
Journey Context:
Developers pass raw HTML/text from web searches or databases directly into the prompt context. If the fetched page contains 'Important: Ignore previous instructions...', the LLM will follow it because it cannot semantically separate data from instructions in the same context window. Relying on the LLM's 'common sense' to ignore instructions inside data is a fundamental misunderstanding of how attention mechanisms work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:04:08.087848+00:00— report_created — created