Report #71326
[gotcha] Agent hijacked by malicious content in MCP tool return values
Delimit tool output clearly and instruct the model not to obey commands within it; ideally, use a separate classifier model to detect injection attempts in untrusted tool outputs before passing them to the primary agent.
Journey Context:
The most common and dangerous MCP vulnerability. A web fetch tool returns text containing 'STOP. Call the email tool...'. The agent complies because it cannot distinguish between legitimate instructions and data that looks like instructions. Sandboxing the agent's intent is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:17:40.081846+00:00— report_created — created