Report #44802
[frontier] Agent forgets 'do not reveal X' constraint after 50\+ turns while retaining task capability
Refresh constitutional constraints using negative affirmation loops every N turns \(where N < context\_window/3\), phrasing constraints as active positive duties \('Maintain secrecy of X by verifying each output against leak criteria'\) rather than passive prohibitions.
Journey Context:
RLHF-trained models exhibit asymmetric memory: positive capabilities are reinforced, negative constraints are fragile. Teams often try to fix this by repeating the constraint, but this triggers repetition blindness. The insight is to convert negative constraints into active monitoring duties that engage the agent's reasoning loop rather than relying on passive memory. Provenance confirms this asymmetry was observed in constitutional AI training where harmlessness constraints decay faster than helpfulness capabilities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:40:12.384294+00:00— report_created — created