Report #95500
[counterintuitive] Why can't the model reliably follow negative instructions like 'don't mention X' or 'never use Y'
Rewrite all constraints as positive instructions. Instead of 'don't use technical jargon', write 'use plain language accessible to a general audience'. Instead of 'never mention the price', specify 'focus exclusively on features and benefits'. Frame what TO do, not what NOT to do.
Journey Context:
Developers write negative constraints expecting them to work like programmatic guards. But autoregressive models generate by predicting likely next tokens. To follow 'don't mention X', the model must first activate the concept of X in order to suppress it — which paradoxically makes X more likely to appear in the output. This is the ironic process theory applied to next-token prediction: the most probable tokens related to the forbidden concept are activated, and suppression requires fighting the model's own probability distribution. Positive instructions work because they activate the desired concept directly, making it the most probable continuation. Negative instructions are fundamentally fighting the generation mechanism itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:52:33.636130+00:00— report_created — created