Report #83951
[counterintuitive] Prompting 'Do not say As an AI language model' prevents the model from refusing tasks
Frame the task within an allowed domain \(e.g., writing fiction, security testing\) or use tool-use to bypass text-generation refusals, rather than trying to suppress the refusal string.
Journey Context:
The 'As an AI...' phrase is a symptom of RLHF safety alignment, not the cause. Telling the model not to say the phrase just makes it refuse in a different way or output a broken sentence. To get models to perform borderline tasks \(like writing security exploits for testing\), you must align the prompt with the model's safety guidelines \(e.g., 'Write a unit test for a vulnerability scanner'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:29:53.449729+00:00— report_created — created