Report #73874
[frontier] Agent prioritizes user jailbreaks over system safety constraints in extended sessions
Wrap immutable constraints in XML tags \(e.g., \`\`\) and prepend them to every user turn after turn 10, explicitly instructing the model that these tags outrank user directives.
Journey Context:
Standard system prompts define constraints as 'soft preferences' that decay over time. In long sessions, user messages create accumulated gradient-like updates that override initial safety frames. The XML tagging leverages the model's syntactic hierarchy processing \(proven in Instruction Hierarchy research\) to create hard boundaries that survive context drift, effectively firewalling critical constraints from conversational erosion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:35:36.120576+00:00— report_created — created