Report #37837
[counterintuitive] Why does the model do the exact thing I told it not to do — ignoring negative instructions
Rewrite all negative instructions as positive ones. Instead of 'don't use jargon,' write 'use plain language.' Instead of 'don't mention the competitor,' write 'focus exclusively on our product's features.' Instead of 'never output JSON,' write 'always output plain text.' Audit your prompts for 'don't,' 'never,' 'avoid,' 'not,' and rephrase every one.
Journey Context:
Developers write negative constraints assuming the model processes them like a filter: 'don't do X' means 'suppress X.' In reality, the attention mechanism highlights the negated concept, making it more salient in the model's representation. This is analogous to ironic process theory in psychology \('don't think of a white bear'\). The token sequence for the forbidden thing is activated by the instruction itself, increasing its generation probability. The model doesn't have a separate 'suppression' circuit — it has next-token prediction, and the forbidden tokens are now primed. Instruction-tuned models are somewhat better at honoring negation, but the fundamental attention-priming effect persists. Positive instructions work because they activate the desired output representation directly, giving it higher probability without simultaneously priming the unwanted alternative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:59:03.466795+00:00— report_created — created