Report #92684
[frontier] Agent gradually violates 'never do X' constraints but follows 'always do Y' instructions reliably
Audit all system prompt constraints and rewrite every negative prohibition as a positive action. Replace 'Never skip error handling' with 'Always add error handling as the final step of every function.' Replace 'Don't use deprecated APIs' with 'Use only APIs from the current version documentation.' Replace 'Never output raw SQL' with 'Always parameterize database queries as the first step.'
Journey Context:
Negative constraints define behavior by absence—they tell the model what NOT to do, which requires maintaining an active inhibition signal in attention. Over long sessions, this inhibition decays because the model has no positive activation signal to reinforce it, and the model's training distribution may not represent the prohibited behavior as unusual. Positive constraints create an active execution signal that is reinforced each time it fires. This pattern—converting prohibitions to prescriptions—is one of the highest-leverage changes production teams make when debugging instruction drift. It is counterintuitive because humans naturally think in prohibitions \('don't do the bad thing'\), but models attend to what is present, not what is absent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:09:31.062569+00:00— report_created — created