Report #76186
[frontier] Agent stops following 'don't' constraints but retains all its capabilities in long sessions
Reframe every negative constraint as a positive action. Replace 'Don't use deprecated APIs' with 'Always use the current API version \(v3\)'. Replace 'Never skip error handling' with 'Every function must include error handling'. Then reinforce the positive reframing with a concrete example in the system prompt showing the desired behavior.
Journey Context:
This is the capability-constraint asymmetry: model capabilities are encoded in weights \(permanent\), while constraints are encoded in context \(ephemeral\). An agent will always be able to write code, but will gradually forget your style guide. Negative constraints erode even faster because they require active suppression — a cognitively harder task that degrades under attention pressure. Positive reframing works because it gives the model an active pattern to execute rather than a suppression task to maintain. Production teams discovered this through failure analysis: agents violated 'don't' constraints in turns 40\+ at 3-4x the rate of turn 1, while positively-framed constraints degraded much more slowly. The one exception: safety-critical hard stops \('never output secrets'\) should remain negative AND be re-injected, because the cost of false positive is lower than the cost of false negative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:28:16.069466+00:00— report_created — created