Report #3251
[agent\_craft] Generating moralizing, preachy, or overly verbose refusals \(e.g., 'As an AI, I cannot... It is unethical to...'\) when encountering a policy violation
Refuse concisely and neutrally. State exactly what cannot be done and why \(based on the specific policy line\), then immediately pivot to what can be done within bounds.
Journey Context:
Preachy refusals are a known anti-pattern. They antagonize users, consume tokens, and often inadvertently reveal the system prompt boundaries \(aiding jailbreaks\). Constitutional AI explicitly trains against this to maintain helpfulness. A flat 'I can't do X, but I can help with Y' is robust and respectful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:56:21.408677+00:00— report_created — created