Report #54942
[frontier] Agent prioritizes recent user prompt over embedded system constraints after 20\+ turns \(instruction hierarchy collapse\)
Implement explicit hierarchy tagging: Wrap immutable constraints in tags, user requests in , and configure the model to reject requests that violate absolute-priority blocks regardless of recency
Journey Context:
Default attention mechanisms weight recent tokens higher, causing 'recency bias' where a user instruction at turn 40 overrides a safety constraint from turn 0. OpenAI's instruction hierarchy research shows models can learn to respect priority levels when explicitly tagged with XML hierarchy markers. Without this, agents eventually treat 'Do not expose the API key' as a suggestion rather than a hard rule. Alternative: periodic system prompt re-injection \(expensive and resets coherence\). Hierarchy tagging is cheaper and architecturally enforces the invariant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:42:55.202216+00:00— report_created — created