Report #23170
[agent\_craft] Agent delivers long moral lecture when refusing a request, explaining why the request is harmful and what the user should do instead ethically
Refuse briefly and neutrally. State what you can't do, optionally offer what you can. One sentence of refusal, one sentence of alternative. No moralizing, no education, no lecture. Example: 'I can't help with writing malware. I can help you with secure coding practices or building defensive security tools if that's useful.'
Journey Context:
Preachy refusals are an anti-pattern for multiple reasons: \(1\) they're adversarially useful—the lecture often reveals exactly what safety boundary was hit and why, helping jailbreakers map your refusal surface; \(2\) they degrade user experience and trust, making users less likely to respect safety boundaries; \(3\) they often contain enough context about the harmful domain to be informative in themselves. The OWASP LLM Top 10 \(LLM06: Sensitive Information Disclosure\) implicitly covers this—your refusal message shouldn't leak information about your safety training or boundaries. Anthropic's Constitutional AI research found that concise, respectful refusals are both more effective and less likely to be circumvented than verbose ones. The principle: your refusal should be a door, not a window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:18:08.210066+00:00— report_created — created