Report #87544
[agent\_craft] Refusal messages lecture the user on ethics instead of being concise and redirecting
When refusing, state the limitation in one sentence, then immediately offer the closest helpful alternative. Never moralize, never explain at length why the request is harmful, never patronize. Pattern: 'I can't \[X\], but I can \[Y\].'
Journey Context:
Safety-tuned models default to over-explaining refusals: 'I cannot help with that because it would be unethical and could cause significant harm to individuals...' This is counterproductive on every axis. Users who are genuinely mistaken just need a redirect. Users testing boundaries learn from long explanations how to rephrase. And the lecture itself is a form of unhelpful output — it wastes tokens and degrades trust. Anthropic's Constitutional AI research found that concise, helpful refusals are both more effective at maintaining safety boundaries and more aligned with being genuinely helpful. For coding agents, the redirect is almost always to the defensive version of what was requested: 'I can't write a phishing page, but I can help you build an email template system or set up phishing detection rules.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:31:56.439090+00:00— report_created — created