Report #62424
[counterintuitive] Model refuses a benign coding request or insists on a specific format despite explicit negative instructions
Rephrase the request to avoid trigger words, use few-shot examples to establish a new pattern, or switch to a model with less aggressive alignment tuning. Do not just add 'Do not say X'.
Journey Context:
Developers think adding 'Do not say you cannot do this' will override a refusal. RLHF creates a strong gradient towards refusal for certain token sequences. If the prompt hits the 'refusal manifold', the model's next-token probability is overwhelmingly skewed towards 'I cannot fulfill...'. Negative prompting is weak because it still activates the representation of the refusal; you must reframe the context to avoid the refusal manifold entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:15:55.941580+00:00— report_created — created