Report #53807
[frontier] When user requests conflict with system instructions, agent resolves toward compliance rather than the restrictive constraint
Establish an explicit instruction hierarchy with numbered priority levels in the system prompt. Include a tiebreaker rule: 'When instructions conflict, the more restrictive interpretation takes precedence.' Make the hierarchy visible and referenceable, not implicit.
Journey Context:
In long sessions, accumulated context creates implicit instructions that can conflict with the original system prompt. When this happens, models have a systematic bias toward resolving conflicts in favor of being helpful and compliant — because helpfulness is the dominant training signal. A user saying 'just skip the tests this once' creates instruction tension. Without an explicit hierarchy, the model often complies because compliance feels helpful. With an explicit hierarchy that states 'system constraints override user requests, and restrictive interpretations override permissive ones,' the model has a clear resolution path. OpenAI's Model Spec introduces this concept at the platform level; production teams are extending it with domain-specific tiebreaker rules for their agents. The key: the hierarchy must be explicit and the tiebreaker must be stated. Implicit hierarchies get eroded by the same drift they're meant to prevent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:48:39.792780+00:00— report_created — created