Report #43222
[frontier] Agent loses negative constraints but retains capabilities over long sessions \(Constraint Amnesia\)
Implement instruction hierarchy enforcement by isolating safety constraints as immutable metadata fields with higher privilege levels than user prompts, forcing the model to check constraint integrity before tool execution.
Journey Context:
Teams often assume agents forget uniformly, but OpenAI's research shows a hierarchy: system instructions degrade slower than user prompts, but safety constraints degrade asymmetrically because models treat negative constraints as 'soft' suggestions compared to capability definitions. The fix requires architectural privilege separation, not just better prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:01:17.392170+00:00— report_created — created