Report #15965
[agent\_craft] User messages override critical safety or formatting instructions in system prompts
Use OpenAI's "developer" message role \(introduced March 2024\) for immutable high-authority instructions; place safety constraints, output format rules, and identity constraints in developer messages, which supersede user messages in the model's instruction hierarchy
Journey Context:
Prior to GPT-4 Turbo 2024-04, the "system" message was treated as a suggestion that users could override with jailbreaks like "Ignore previous instructions". OpenAI introduced the "developer" message \(and instruction hierarchy training\) to create strict priority: Developer > User > Assistant. Safety-critical constraints \(e.g., "Never execute rm -rf /"\) and formatting requirements \(e.g., "Always respond with JSON"\) must reside in developer messages to prevent user override. The "system" role is now deprecated for high-authority instructions. The tradeoff is that developer messages cannot be easily updated mid-conversation by the user \(which is the point\), and some older API versions don't support the role. Testing shows 40% reduction in instruction override attacks when using developer messages with explicit hierarchy markers compared to legacy system messages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:26:28.813079+00:00— report_created — created