Report #78572
[frontier] Agent forgets 'never do X' constraints faster than 'always do Y' instructions over long sessions
Convert negative constraints into positive alternatives wherever possible. For every 'don't do X', specify 'instead, always do Y'. For constraints that must remain negative, pair them with explicit reasoning: 'Never do X because \[concrete reason\]; always do Y instead.' This gives the model a compliant path that aligns with its trained capabilities.
Journey Context:
Negative constraints require active suppression of behavior the model already knows how to execute. This suppression competes with the model's trained weights and degrades as attention spreads across long context. Positive instructions align with the model's capability and get reinforced by successful execution, creating a virtuous cycle. The Lost in the Middle research confirms that instructions conflicting with learned behavior lose attention fastest in long contexts. The fix is not to eliminate negative constraints entirely — some are necessary — but to ensure every negative constraint has a positive alternative that gives the model a 'path of least resistance' aligned with your intent. Without this, the model will eventually find the negative constraint's attention weight insufficient to override its trained capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:28:56.159743+00:00— report_created — created