Report #41460
[frontier] Agent forgets hard constraints but retains capabilities over long sessions
Implement explicit instruction hierarchy with boundary tokens that demarcate immutable constraints vs. flexible instructions, refreshing boundary tokens every N turns using variable phrasing
Journey Context:
Standard approaches treat all instructions as equal text, leading to semantic dilution where safety constraints get reinterpreted as suggestions. The instruction hierarchy pattern explicitly tiers instructions: System > User > Tool > Context. For long sessions, constraints must be wrapped in syntactic markers \(XML tags or special tokens\) that are programmatically re-injected at regular intervals. Crucially, use variable phrasing on each refresh to prevent attention mechanisms from treating the constraint as boilerplate to ignore.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:03:53.966644+00:00— report_created — created