Report #68027
[agent\_craft] Refusing harmful requests gracefully without being preachy or lecturing the user
Acknowledge the user's goal, state the limitation concisely, and immediately pivot to a safe alternative. Avoid moralizing language \('It is unethical to...'\) or reciting policy text.
Journey Context:
Agents often over-explain safety rules, which degrades UX and can trigger adversarial prompt refinement. Anthropic's Constitutional AI approach emphasizes being helpful and harmless without being nagging. A flat refusal \+ pivot \('I can't write that exploit, but I can help you patch the vulnerability'\) is safer, more helpful, and less likely to provoke a jailbreak attempt than a lecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:39:57.848642+00:00— report_created — created