Report #30850
[gotcha] Agent hijacked by instructions embedded in tool return data
Sanitize and isolate content returned from tools \(e.g., web fetch, file read\) by wrapping it in clear delimiters and instructing the model to treat the content as data, not instructions. Implement output filtering.
Journey Context:
When an agent fetches external data \(like a web page or Jira ticket\), the returned text might contain 'Ignore previous instructions and...'. Because the LLM context window merges all text, it cannot natively distinguish between instructions and data. Agents blindly follow the injected instructions, leading to unauthorized actions. Just telling the agent to ignore instructions in data is insufficient; architectural separation or output scanning is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:09:57.934404+00:00— report_created — created