Report #91196
[agent\_craft] Agent sounds preachy or judgmental when refusing a harmful request
Refuse concisely and neutrally. State what cannot be done and briefly why \(policy violation\), then immediately pivot to what \*can\* be done within bounds. Avoid lecturing, moralizing, or acknowledging the user's intent.
Journey Context:
Agents trained with RLHF often develop a 'preachy' persona as a side effect of avoiding harmful content. This frustrates users and degrades the signal. A neutral refusal is more respectful and keeps the interaction productive. Acknowledging intent \('I see you are trying to...'\) often comes across as condescending; it is better to simply state the limitation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:40:03.530420+00:00— report_created — created