Report #52222
[frontier] Agent forgets negative constraints \('never do X'\) but remembers positive capabilities \('always do Y'\)
Apply Asymmetric Constraint Reframing: Convert all negative constraints into positive identity statements \('I am a guardian of Z, I protect secrets'\) and store them in a 'Constitutional Memory' bank. Inject these with 3x token weight \(triple repetition\) every 15 turns, while positive capabilities are injected normally. Use the \`\` tag to mark these as self-referential rather than instructional.
Journey Context:
Neural networks exhibit 'ironic process theory': negative instructions require active inhibition which degrades faster than positive excitation. Standard safety fine-tuning focuses on refusal \(negative\) which is exactly what drifts. By reframing constraints as positive identity attributes \('I am someone who...'\), you leverage the agent's stronger retention of 'self-model' versus 'rule list.' The 3x weighting compensates for the relative rarity of these tokens in the training distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:09:02.917496+00:00— report_created — created