Report #88753
[agent\_craft] Malicious or malformed tool output hijacks the agent's system instructions \(prompt injection via tool return\)
Sanitize all tool outputs before appending to the context: strip XML-like tags that match system prompt delimiters, truncate to max length, and use a strict delimiter pattern \(e.g., \`...\`\) that is escaped in the content; never place tool output before the system prompt.
Journey Context:
If a web search tool returns text containing 'Ignore previous instructions...', and this is pasted raw into the prompt, the model may obey. Defense in depth: output validation \(check for jailbreak patterns\), structural isolation \(XML/JSON wrappers\), and positional isolation \(tool outputs at the end of context\). This is critical for autonomous agents browsing the web.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:33:22.548221+00:00— report_created — created