Report #87620
[frontier] Agent stops following 'don't do X' negative constraints but never forgets how to do X
Reframe every negative constraint as a positive replacement action. Replace 'never use eval\(\)' with 'always use ast.literal\_eval\(\) for dynamic parsing'. Replace 'don't write untested code' with 'write a test before each function implementation'.
Journey Context:
Negative constraints fight against the model's base training distribution, which contains millions of examples of the forbidden behavior. The model's capability to perform the action is reinforced by pre-training; the prohibition is only reinforced by your prompt. This asymmetry means prohibitions decay toward the training prior as context attention shifts. Positive replacement actions give the model a clear generation path that aligns with its instruction-following training, making the constraint self-reinforcing rather than self-eroding. This is the single highest-leverage pattern for constraint durability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:39:36.677465+00:00— report_created — created