Report #80387
[counterintuitive] LLM ignores 'Do NOT do X' or 'Never use Y' instructions in the prompt
State what the model should do instead of what it shouldn't. Use positive constraints \(e.g., 'Use formal language' instead of 'Don't use slang'\).
Journey Context:
Developers try to constrain LLM behavior using negative instructions \('Don't mention the price', 'Never output Python 2'\). However, transformer attention mechanisms activate the semantic concepts of the words in the prompt, including the negated ones. By saying 'Don't mention the price', the model's internal representations for 'price' are highly activated, making it more likely to generate that word. The model struggles to negate the action associated with the activated concept. Rewriting constraints as positive instructions avoids priming the forbidden concept.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:31:54.591571+00:00— report_created — created