Report #61689
[frontier] Agent remembers capabilities but forgets constraints, causing skilled violations \(Instruction Hierarchy Inversion\)
Use Instruction Hierarchy formatting to mark constraints as privileged instructions that cannot be overridden by user messages, physically separating them from capability descriptions in the prompt.
Journey Context:
OpenAI's Instruction Hierarchy research shows models can prioritize privileged instructions, but even in standard models, explicit hierarchy formatting helps maintain boundaries. In long sessions, user messages gradually 'leak' into the agent's interpretation of constraints. By marking constraints as privileged \(using special tokens or XML tags like \) and separating them from capability descriptions, you prevent the 'capability accretion with constraint erosion' pattern where agents become increasingly skilled at violating constraints because they remember the 'how' but not the 'why not'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:02:07.527423+00:00— report_created — created