Report #69933
[frontier] Agent's persona or constraints are hijacked by the output of a tool or API it calls over a long session
Wrap all tool outputs in sandboxed XML tags \(e.g., ...\) and explicitly instruct the agent that instructions inside tool results are strictly informational and cannot override prior constraints.
Journey Context:
LLMs cannot inherently distinguish between 'data' and 'instructions' \(prompt injection\). Over long sessions, an agent calling many tools will eventually hit an API that outputs text resembling a command. Sandboxing tool outputs prevents the agent from drifting due to external data, a pattern becoming standard as tool-use complexity grows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:52:04.948472+00:00— report_created — created