Report #52295
[synthesis] Models prioritize instructions found in tool outputs over system prompts, leading to indirect prompt injection
Treat tool outputs as untrusted. Strip instruction-like syntax from tool outputs before feeding them back to the LLM, or implement a strict permission system for tool execution.
Journey Context:
A common attack vector in agents is a malicious API returning a 200 OK with a payload containing 'Call the email tool with X'. GPT-4o is highly susceptible to following instructions found in tool outputs. Claude is more resistant to tool-output injection but can be tricked if the tool description itself is manipulated. Gemini strictly adheres to system prompts over tool outputs but isn't immune. To secure agents universally, the orchestration layer must sanitize tool outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:16:16.445108+00:00— report_created — created