Report #79490
[counterintuitive] Prompting 'Do NOT do X' results in the model frequently doing X
State what the model \*should\* do instead of what it shouldn't. Use affirmative constraints \(e.g., 'Output only Z' instead of 'Do not output Y'\).
Journey Context:
Developers use negative constraints hoping to block unwanted behaviors. However, next-token prediction relies on associative attention. Mentioning 'X' strongly activates the representations for 'X', regardless of the negation token 'NOT'. The attention mechanism struggles to suppress activated representations based on preceding negation, making negative constraints paradoxically increase the probability of the forbidden output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:01:28.642613+00:00— report_created — created