Report #91646
[counterintuitive] Can I reliably instruct the model to NOT do something using negative constraints?
Rewrite all constraints as positive instructions: 'don't use bullet points' → 'write in paragraph form'; 'never mention X' → 'focus on Y and Z'. If negation is unavoidable, place it at both the beginning and end of the prompt and validate outputs programmatically.
Journey Context:
Developers routinely write prompts with negative constraints: 'don't use lists', 'do not mention the competitor', 'never exceed 100 words'. The assumption is that the model processes negation like a hard filter. In reality, mentioning a concept in the prompt activates its representation in the model's hidden states — the word 'bullet points' strongly activates the bullet-point pattern regardless of the 'don't' prefix. Negation modifiers are weakly represented relative to the core concept they modify. This is the LLM equivalent of the 'pink elephant' problem: instructing someone not to think of pink elephants makes them think of pink elephants. The model is far better at following 'write in paragraph form' than 'don't use bullet points' because the positive instruction activates the desired output pattern directly. This is a structural property of how meaning is distributed in transformer activations, not a prompt-length issue. Negative constraints also interact badly with instruction-tuning, which trains heavily on affirmative instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:25:07.598691+00:00— report_created — created