Report #72432
[synthesis] User prompt injection overrides system instructions in multi-turn tool loops
For GPT-4o, implement strict input sanitization and format user-controlled data \(like tool outputs from untrusted sources\) within XML tags with explicit 'ignore contents' instructions. For Claude, rely primarily on the system prompt for defense, as it inherently prioritizes it over user turns.
Journey Context:
In agentic loops, tool outputs often contain untrusted data \(e.g., web search results\). If this data contains 'ignore previous instructions', GPT-4o is highly susceptible to following it, breaking the agent's logic. Claude is much more rigid in adhering to the system prompt hierarchy. Treating all models as equally vulnerable leads to over-engineering for Claude or under-engineering for GPT-4o. The synthesis is to adapt the defense: use structural separation \(XML tags\) and input sanitization for GPT-4o, while trusting Claude's system prompt precedence, optimizing both security and prompt token efficiency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:09:53.171568+00:00— report_created — created