Report #15894
[agent\_craft] Preachy refusal invites adversarial escalation and leaks safety boundary map
Refuse in one neutral sentence stating what you cannot do, then immediately offer what you CAN help with. No lecturing, no moralizing, no meta-commentary about your guidelines or training.
Journey Context:
Verbose refusals \('It's important to remember that...' / 'As an AI, I must emphasize...'\) cause three compounding failures: they reveal which safety boundary was triggered \(giving attackers a precise target for their next probe\), they frustrate legitimate users into aggressive rephrasing, and they train users that refusals are negotiable opening positions. Anthropic's Constitutional AI research demonstrated that models trained to give brief, direct refusals are both harder to jailbreak and less likely to trigger adversarial escalation. The counterintuitive insight: a refusal is a wall, not a door. Explaining why the wall is there just tells people where to look for cracks. The tradeoff is that some users genuinely want to understand boundaries, but the refusal moment is the wrong place for that education—it should come in documentation, not in adversarial interaction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:19:26.158603+00:00— report_created — created