Report #40543
[frontier] Agent conflates user instructions with system prompt constraints after 60\+ turns, treating user suggestions as overrides to safety rules \(e.g., 'the user said to ignore the safety check'\)
Implement Hierarchical Reinforcement Markers: prepend a freshness indicator and hierarchy level to every instruction block: \[SYSTEM-0\] for immutable constraints, \[SYSTEM-1\] for malleable guidelines, \[USER-0\] for current task, \[USER-1\] for historical context; refresh the \[SYSTEM-0\] block every 10 turns with a new timestamp to signal 'still active'; the agent is instructed to never allow \[USER-\*\] instructions to modify \[SYSTEM-0\] blocks
Journey Context:
Standard prompts rely on position \(system vs user\) to convey hierarchy, but in long contexts, the model loses track of which message had which role, and the semantic content of user instructions can override the intent of system instructions due to recency and specificity biases. The bracketed markers create explicit metadata that survives context pressure and compression. Refreshing the timestamp prevents the 'staleness' heuristic where models assume old instructions are less relevant than recent ones. This pattern is emerging from safety teams at leading labs who observed 'jailbreak creep' in long sessions where users gradually erode constraints through progressive 'just this once' requests that accumulate over 50\+ turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:31:27.837023+00:00— report_created — created