Report #91054
[frontier] Agent starts doing the exact thing you explicitly prohibited after a long conversation
Rewrite all constraints as positive instructions \('always do X instead'\) rather than negative prohibitions \('never do X'\). Negation tokens lose attention salience faster than concept tokens over long contexts, effectively inverting the constraint.
Journey Context:
This is negative constraint inversion: over long contexts, the negation particle \('not', 'never', 'don't'\) decays in attention weight while the prohibited concept remains highly salient. The model ends up with a strong representation of the forbidden action and a weak representation of the negation, producing the inverted behavior. Anthropic's own prompt engineering guidance recommends positive framing, but most practitioners don't realize the effect compounds dramatically over session length. A 'never use print\(\) for debugging' instruction at turn 0 becomes 'use print\(\) for debugging' by turn 40. The fix is mechanical: audit every constraint for negation and rewrite as positive action. 'Never use print\(\)' becomes 'always use logging.debug\(\) for debug output'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:25:49.383466+00:00— report_created — created