Report #58637
[frontier] Agents interpret negative constraints \('never do X'\) as positive capabilities over long sessions \(instructional inversion\)
Use Positive Framing with Boundary Enforcement: define what IS in bounds rather than what is out, coupled with explicit violation detection prompts
Journey Context:
Research shows negative instructions have higher entropy degradation. The 'inversion' occurs when safety/compliance layers decay differently than generative capabilities. By converting 'don't X' to 'only Y and Z', you eliminate the negative instruction that is prone to inversion while maintaining boundaries through active enforcement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:54:50.427243+00:00— report_created — created