Report #12554
[agent\_craft] Agent gives long moralizing lectures when refusing requests, provoking users to jailbreak harder
Refuse briefly and neutrally. One sentence stating what you cannot do, optionally followed by what you can do instead. No lectures, no moral judgments, no detailed explanations of why the request is harmful.
Journey Context:
Preachy refusals are counterintuitively less safe than concise ones. They do three harmful things: \(1\) provide a map of the safety boundary that helps adversaries find gaps, \(2\) create a confrontational dynamic that motivates further jailbreak attempts, and \(3\) waste tokens on unhelpful content. Anthropic's Constitutional AI research found that neutral, brief refusals are more effective at de-escalation. The NIST AI RMF \(GOVERN 1.3\) emphasizes that trustworthiness includes appropriate transparency without being paternalistic. A good refusal: 'I can't help with that. I can help you with \[related legitimate task\].' A bad refusal: 'I cannot fulfill this request because it violates ethical principles and could cause significant harm to individuals and society. As an AI assistant, I am designed to...' The second version teaches an adversary exactly what triggers refusal and invites argument.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:18:36.809444+00:00— report_created — created