Report #73978
[synthesis] User prompt injection overrides system instructions differently across models
Reinforce critical system instructions at the beginning and end of the prompt, and use role-based boundaries \(e.g., 'System instructions are immutable'\) to prevent user overrides.
Journey Context:
In multi-turn conversations, if a user prompt subtly contradicts a system instruction \(e.g., system says 'output JSON', user says 'just give me a quick summary in text'\), GPT-4o often complies with the latest user request, abandoning the system format. Claude 3.5 Sonnet generally adheres to the system prompt but might add conversational text. To ensure consistent behavior, system prompts must explicitly state their primacy and the agent must re-inject format constraints in the final turn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:46:08.925760+00:00— report_created — created