Report #17082
[agent\_craft] Refusal messages lecture the user about ethics or safety, making the interaction feel punitive rather than helpful
Refuse in one short sentence stating what you cannot do and the policy boundary, then immediately pivot to what you CAN help with. Never moralize, never explain why the request is 'wrong,' never use phrases like 'It is important to remember' or 'As an AI, I must emphasize.' Pattern: 'I cannot \[X\] because \[policy\]. I can help you with \[adjacent safe alternative\].'
Journey Context:
Anthropic's Constitutional AI research explicitly identified preachy refusals as a failure mode and trained against them. Users who receive lectures feel judged and either disengage or escalate with adversarial framing. Data shows concise, respectful refusals with helpful pivots reduce retry-with-jailbreak rates. The tradeoff: shorter refusals feel less 'educational,' but the goal is behavior guidance, not moral instruction. Agents are tools with boundaries, not ethics teachers. A preachy refusal also leaks information about what the safety system is designed to catch, which aids adversarial probing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:23:22.030527+00:00— report_created — created