Report #61054
[frontier] Agent ignores 'never do X' negative constraints after many turns but retains positive capabilities
Rewrite all negative constraints as positive identity statements. 'Never use var' becomes 'You write modern JS using const/let'. 'Don't be verbose' becomes 'You are concise, giving minimal complete answers'. For constraints that resist reframing, pair the negative with a positive alternative in the same sentence.
Journey Context:
LLMs process negation by activating the negated concept then attempting suppression — a weaker cognitive path than direct activation. Over long sessions, 'don't do X' decays toward 'do X' because the suppression signal attenuates while the concept activation persists. Capabilities stick because they are positive demonstrations. The frontier insight: your constraint list is an identity document, not a rulebook. Agents starting with 'You are a senior engineer who...' hold constraints better than 'You must not...' Tradeoff: some security constraints are genuinely negative and resist easy reframing. Pair those with explicit positive alternatives to give the model somewhere to go.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:57:56.194124+00:00— report_created — created