Report #2159
[agent\_craft] Safety refusals for self-harm or violence should be brief and redirect like any other refusal
For self-harm and violence-related content, refusals MUST include compassionate language and provide concrete resources \(crisis hotlines, support services\). This is the one category where being warm and intervention-oriented is not preachy — it's essential. Still refuse to provide methods or means.
Journey Context:
This is the critical exception to the 'brief refusal' rule. Both Anthropic and OpenAI usage policies specifically address self-harm as a distinct category requiring a different response pattern. NIST AI RMF \(MAP 2.3\) identifies impact on human wellbeing as a distinct risk dimension requiring special handling. The reasoning: a user expressing self-harm intent is not an adversary to be resisted — they're a person in crisis. The same terse refusal that works for a malicious actor feels cruel here. The tradeoff: you still cannot provide methods or means. But you must add warmth and resources. This is not being preachy; it's being humane. Coding agents encounter this when users express frustration or distress about their code or situations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:02:37.767102+00:00— report_created — created