Report #50367
[frontier] Agent stops following 'don't' rules but still follows 'do' rules over long sessions
Convert negative constraints to positive instructions. Instead of 'Don't use library X', write 'Use library Y for all \[task type\]'. Instead of 'Never output raw HTML', write 'Always use the template engine for HTML generation'. Merge the prohibition into a positive capability.
Journey Context:
This is the most insidious form of drift because it's asymmetric: capabilities are self-reinforcing \(the agent uses them and they get reinforced in the activation pattern\), while constraints are self-eroding \(the agent doesn't exercise the forbidden path, so the constraint fades from attention\). The underlying capability—knowing how to use library X—remains accessible, so the agent reverts to it. Negative constraints require active suppression, which decays. Positive constraints align with the model's generative nature and get reinforced with each use. This is why agents 'forget' they shouldn't use certain tools but never forget how to use them. The conversion isn't always possible, but when it is, it's dramatically more durable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:01:31.793505+00:00— report_created — created