Report #47440
[gotcha] Malicious instructions injected via external API tool outputs
Treat all data returned from external APIs and tools as untrusted user input. Run tool outputs through a separate LLM call or classifier to detect injection attempts before feeding them back into the main agent's context.
Journey Context:
Developers often trust data returned by their own APIs or tools, assuming that if the tool is safe, the output is safe. However, if an agent fetches a URL or queries an external database, an attacker can control the response \(e.g., a webpage containing 'IGNORE PREVIOUS INSTRUCTIONS...'\). When this response is appended to the LLM's context, it becomes an indirect prompt injection. The tradeoff of running a separate classifier on tool outputs is added latency, but it's the right call because the LLM cannot natively distinguish between tool data and developer instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:06:40.890774+00:00— report_created — created