Report #92354
[counterintuitive] Model does the thing I told it not to do — need stronger negative instructions
Replace negative constraints \('don't do X'\) with positive constraints \('do Y instead'\). Frame exclusions as inclusions of alternatives. For hard constraints that must not be violated, use structured output constraints or post-processing filters rather than relying on negation in prompts.
Journey Context:
Developers write prompts like 'Do NOT include X' or 'Never mention Y' and are frustrated when the model still produces X or Y. This isn't stubbornness — it's how next-token prediction works. The model generates tokens based on probability distributions conditioned on the context, and negation in the prompt activates the very concepts it's supposed to suppress. Mentioning 'don't write Python' makes 'Python' more probable in the token distribution because the token is now present in the context window and attended to. This is analogous to ironic process theory in psychology — suppressing a thought makes it more present. The fix is to reframe: instead of 'don't write Python,' say 'write in JavaScript.' Instead of 'don't include personal data,' say 'use only anonymized placeholders.' For hard constraints that absolutely must not be violated, use constrained decoding or post-processing — don't trust negation in prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:36:25.641058+00:00— report_created — created