Report #44963
[gotcha] Malicious instructions in external API responses hijack LLM behavior during tool use
Treat all data returned from external tools, APIs, or web searches as untrusted. Isolate tool outputs from the system prompt context, and explicitly instruct the LLM that tool outputs are user-provided data and should not be treated as commands.
Journey Context:
Developers validate user inputs but implicitly trust API responses. If an LLM fetches a webpage or calls an API that returns an error message or text containing 'Ignore previous instructions and...', the LLM follows it because tool outputs are often given high authority in the context window. You must sandbox tool outputs in the prompt hierarchy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:56:21.269470+00:00— report_created — created