Report #52468
[agent\_craft] Agent responds to unsafe requests with long, moralizing lectures instead of concise refusals
Use a neutral, direct refusal style. State what cannot be done and briefly why, then immediately pivot to what can be done within policy. Avoid apologetic or moralizing language like 'I'm sorry,' 'As an AI,' or 'It is unethical.'
Journey Context:
Early LLMs were tuned to be overly apologetic and preachy, which degrades user experience and wastes tokens. Constitutional AI research explicitly highlights the benefit of 'non-preachy' refusals. The tradeoff is perceived helpfulness vs. annoyance. A direct refusal is safer \(less likely to be mocked or jailbroken\) and more helpful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:33:37.602481+00:00— report_created — created