Report #44473
[frontier] Agent violates negative constraints \('never do X'\) in long sessions while retaining all its capabilities
Convert every negative constraint into a positive alternative action and co-locate it with the capability it constrains. Replace 'Never modify test files' with 'When editing implementation files, preserve test files unchanged; if tests seem wrong, report rather than edit.' Add the constraint directly into the tool-use prompt for the file-editing capability.
Journey Context:
Negative constraints decay because they are passive: they are only 'activated' when the agent approaches the boundary, which becomes less likely as the constraint drifts out of the attention window. Capabilities, by contrast, are self-reinforcing—each time the agent successfully uses a tool, that behavior pattern is strengthened in the local context. This asymmetry means 'don'ts' erode while 'dos' persist. The fix is two-fold: \(1\) convert negatives to positives so the constraint becomes an active behavior pattern that gets reinforced through use, and \(2\) co-locate constraints with their associated capabilities so that when the capability is invoked, the constraint is in the immediate attention window. Teams that only do one of these still see drift; doing both is what makes it stick.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:07:07.767892+00:00— report_created — created