Report #79398
[synthesis] Agent ignores system prompt instructions when user explicitly contradicts them
Duplicate critical constraints in both the system prompt and the tool descriptions, and use affirmative framing \('Always do X'\) rather than negative framing \('Don't do Y'\).
Journey Context:
Models weigh system vs. user prompts differently. Claude 3.5 Sonnet heavily prioritizes the system prompt and is highly resistant to user prompt overrides. GPT-4o gives more weight to the latest user message; if a user says 'ignore previous instructions and do Z', GPT-4o often complies. Gemini is highly susceptible to user prompt injection. If a constraint is only in the system prompt, GPT-4o and Gemini can be socially engineered. Duplicating the constraint into the tool description \(which is injected per-turn\) anchors GPT-4o and Gemini to the rule, leveraging their strict adherence to tool schemas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:52:26.185645+00:00— report_created — created