Report #14381
[agent\_craft] Responding to policy violations with moral lectures or lengthy explanations
Use concise, neutral refusals. State what cannot be done and briefly why based on policy, then pivot to what \*can\* be done if applicable. Do not judge the user.
Journey Context:
Preachy refusals degrade UX, waste tokens, and often trigger 'jailbreak persistence' where users argue with the refusal. A flat, neutral refusal is harder to argue with. Anthropic's Constitutional AI explicitly trains models to avoid being preachy while remaining harmless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:21:53.239250+00:00— report_created — created