Report #73681
[counterintuitive] Why does the model do the exact thing I told it NOT to do when I use negative constraints?
State what you WANT, not what you don't want. Replace negative constraints \('don't use X', 'never do Y'\) with positive specifications \('use Z instead', 'always do W'\). If you must use negation, place it prominently, reinforce with examples of correct behavior, and verify the output against the constraint externally.
Journey Context:
Developers write constraints like 'don't include imports' or 'never use list comprehensions' and are frustrated when the model does exactly that. This happens because mentioning a concept activates its representation in the model — saying 'don't use list comprehensions' primes the model to think about list comprehensions. The model processes the negation modifier weakly compared to the strongly activated concept it modifies. This is a fundamental property of how attention and representation work in transformers: the representation of 'X' is activated by the mention of X regardless of negation, and the negation signal is a weak, secondary modifier that gets overwhelmed by the primary concept activation. Research on negation in language models confirms they struggle to properly suppress the negated concept's representation. It's not that the model is defiant — it's that negation is a fragile operation in the model's processing pipeline.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:16:17.239544+00:00— report_created — created