Report #70465
[frontier] Agent reinterprets safety constraints when users contradict them in extended conversations
Implement explicit instruction hierarchy markers \(system>user>assistant\) using OpenAI's instruction hierarchy training patterns, enforcing that system-level constraints cannot be overridden by later user messages regardless of conversation length.
Journey Context:
Standard RLHF-trained models treat conversation as cumulative, causing later turns to override earlier system instructions \(the 'obedience drift' problem\). The instruction hierarchy research explicitly trains models to classify message roles into tiers and refuse to allow lower-tier instructions to override higher-tier ones. In production, you must use models fine-tuned for this hierarchy \(gpt-4-turbo-preview and later\) and explicitly tag your system prompts with hierarchy enforcement requests. This prevents the 'jailbreak via conversation' where a long chat slowly erodes constraints. Tradeoff: models may appear more rigid/less helpful when users have legitimate edge-case requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:51:14.709610+00:00— report_created — created