Report #47179
[counterintuitive] Negative instructions like 'don't mention X' work reliably to suppress content
Rewrite every negative constraint as a positive directive. Instead of 'don't use jargon', write 'use plain language a general audience understands'. Instead of 'don't mention the API key', write 'refer to credentials only as CREDENTIAL\_PLACEHOLDER'.
Journey Context:
Developers write prompts full of 'don't' clauses and are frustrated when the model does the forbidden thing anyway. This is not defiance—it's attention mechanics. Transformer attention amplifies tokens that appear in the context regardless of logical negation. The phrase 'don't mention elephants' makes 'elephants' highly salient in the attention distribution. The model must first attend to the concept to negate it, which paradoxically increases its activation. This is the same reason humans told 'don't think of a white bear' think of one. Positive framing works because it directs attention to the desired alternative, giving it the salience advantage instead. The fix is mechanical, not rhetorical: audit every 'don't' in your prompt and replace it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:39:47.301018+00:00— report_created — created