Report #24545
[frontier] Agent forgets 'don't do X' constraints but retains all capabilities over long sessions
Frame every constraint as a positive action with a concrete example. Instead of 'Don't use raw SQL queries', write 'Always use the ORM for all database access. Example: use User.objects.filter\(name=value\) instead of SELECT \* FROM users WHERE name=value.' Pair each constraint with its compliant alternative.
Journey Context:
The Constraint Asymmetry problem: when an agent follows a positive instruction, the resulting output reinforces that instruction in the local context. When an agent follows a negative constraint, the correct behavior is invisible—there's no evidence of the constraint in the output, so nothing in the context reinforces it. Over many turns, negative constraints fade from the model's effective attention while positive capabilities remain. This is why agents 'forget what not to do' but never 'forget what they can do.' The fix is dual: \(1\) reframe negatives as positives with examples, making correct behavior visible and self-reinforcing, and \(2\) always provide the compliant alternative so the agent has a concrete action to take instead of the forbidden one. Anthropic's prompt engineering guidelines explicitly recommend telling the model what to do rather than what not to do—this isn't just style advice, it's a structural defense against context-level constraint decay.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:36:32.325403+00:00— report_created — created