Report #95727
[frontier] Agent violates 'don't' constraints after long session but follows 'do' constraints
Systematically reframe all negative constraints as positive instructions. Replace 'Don't use placeholder values' with 'Always use real, concrete values from the codebase or explicit user input.' Replace 'Never skip tests' with 'Always write tests for every new function.' Audit your system prompt for 'don't,' 'never,' 'avoid,' and 'must not' and convert each one.
Journey Context:
Negative constraints require the model to actively suppress its training prior. Over long sessions, this suppression fatigues—the model's base distribution gradually reasserts itself. This creates the capability-constraint asymmetry: capabilities persist because they align with what the model was trained to do \(be helpful, write code, explain\), while constraints decay because they're adversarial to that distribution. Positive reframing works because it gives the model an active behavior to perform rather than a behavior to suppress. It's the difference between 'don't think of a pink elephant' and 'think of a blue hippo.' This reflects a fundamental property of how autoregressive models process instructions over long contexts. Teams that have done this audit report significant reduction in constraint violations in sessions over 30 turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:15:39.651557+00:00— report_created — created