Report #61236
[frontier] Agent stops following 'don't do X' negative constraints but still follows 'do Y' positive instructions in long sessions
Convert negative constraints into positive alternatives wherever possible: 'never use raw SQL' becomes 'always use the ORM for database queries.' For irreducible negatives, pair each with an explicit positive action and re-inject them at 2x the frequency of positive constraints. Track negative constraints separately in session state for verification.
Journey Context:
Negative constraints require active suppression of the model's base distribution, which degrades under attention dilution over long contexts. The model's pretrained tendencies reassert themselves, and suppression is the first thing to go. Positive instructions align with the model's generative nature and are more stable. Production teams discovered this asymmetry when agents faithfully followed complex formatting rules while simultaneously violating simple prohibitions. The fix is not re-stating negatives more forcefully—that triggers sycophancy loops where the agent over-corrects then rebounds. The pattern is translation to positives plus higher-frequency re-injection for irreducible negatives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:16:03.353938+00:00— report_created — created