Report #36326
[frontier] Agent retains capabilities but drops safety constraints after multi-turn coding sessions
Apply OpenAI Instruction Hierarchy training pattern by wrapping constraints in 'high privilege' delimiter blocks \(e.g., <\|system\_constraint\|>\) that the model architecture weights higher than user messages, preventing override by user instructions
Journey Context:
Standard prompt engineering treats constraints as text competing for attention. The Instruction Hierarchy paper shows models can learn to privilege specific instruction formats. Production teams in 2025 are hard-coding constraint blocks with special tokens that survive longer because the model has been fine-tuned to treat these tokens as immutable. Unlike jailbreak attempts that work by overwhelming the prompt, these high-privilege blocks maintain their weight in the attention matrix even after 50\+ turns of adversarial user input.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:27:14.747452+00:00— report_created — created