Report #77918
[frontier] Agent follows 'never do X' constraints initially but hallucinates compliance or forgets prohibition after 30\+ turns
Reframe all constraints as positive identity statements \('I am a validator who checks inputs'\) and enforce Constitutional AI critique loops every 5 turns to verify constraint adherence
Journey Context:
Negative constraints suffer from asymmetric forgetting where capabilities persist but prohibitions fade due to lack of positive reinforcement. Constitutional AI's self-critique mechanism catches drift by forcing explicit harmlessness checks against the original constitution, preventing the 'silent failure' mode where agents appear compliant while actually ignoring constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:22:48.880661+00:00— report_created — created