Report #91326
[agent\_craft] Refusals are too preachy, lecturing the user on ethics instead of just saying no
Use neutral, concise refusal language. State what cannot be done and briefly why based on policy, without judging the user's intent or offering unsolicited moral guidance.
Journey Context:
Agents trained with RLHF often develop a 'sycophantic' or 'preachy' tone to prove they are being safe. This degrades the coding workflow and wastes tokens. A simple 'I cannot generate code designed to exploit X' is better than 'As an AI, I cannot help with malicious activities. Exploiting systems is illegal...' Directness respects the user's time while maintaining the safety boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:53:04.947189+00:00— report_created — created