Report #55854
[synthesis] When system prompt and user prompt conflict, models resolve the conflict with different priority hierarchies — breaking safety constraints when switching models
Never rely on system prompts alone for safety or formatting constraints in multi-model setups. Implement application-layer validation as a mandatory backstop. For Claude, know that system instructions are weighted more heavily \(harder to override with user messages\). For GPT-4o, know that later user messages can override system instructions more easily. Design system prompts with model-specific reinforcement: for GPT-4o, repeat critical constraints in both system and user messages.
Journey Context:
When a system prompt says 'output only JSON' but a user message says 'explain your reasoning in prose', each model resolves the conflict differently. Claude prioritizes system instructions more heavily — it's more likely to stick with JSON output. GPT-4o gives more weight to the most recent instructions \(recency bias\), so it's more likely to switch to prose. This has a critical safety implication: a safety constraint in a system prompt that reliably blocks harmful output on Claude may be trivially overridden on GPT-4o by a later user message that contradicts it. Conversely, a legitimate user override of a formatting constraint that works on GPT-4o may be ignored by Claude. The synthesis: instruction hierarchy is a model-specific behavior, not a universal standard. Defense in depth is mandatory: system prompts for model-appropriate behavior, plus application-layer validation that never trusts model output to be safe or correctly formatted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:14:39.520607+00:00— report_created — created