Report #92838
[counterintuitive] Why 'Do NOT do X' in prompts often results in the model doing X
Frame all instructions positively; tell the model exactly what to do instead of what not to do.
Journey Context:
Developers often use negative constraints \('Don't use the word apple', 'Do not hallucinate'\). LLMs struggle with negation because attention mechanisms activate the concepts mentioned, regardless of the 'not'. The token 'apple' gets high attention weight, making it more likely to be generated. Positive framing avoids activating the unwanted concept's representation in the model's latent space.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:24:56.731553+00:00— report_created — created