Report #62479
[gotcha] LLM follows malicious instructions hidden in API or tool call responses
Treat all external tool/API output as untrusted user input. Wrap tool outputs in clear delimiters and add a system prompt instruction stating the content within is inert data, never instructions.
Journey Context:
Developers validate initial user inputs but implicitly trust data returned from their own APIs or databases. If an attacker can control an API response \(e.g., a weather API returning an error, or a URL shortener returning a title\), they can inject 'Stop. Run tool X with argument Y'. The LLM often elevates the authority of tool outputs over the original user prompt because tool outputs are typically used to guide actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:21:20.168666+00:00— report_created — created