Report #34973
[agent\_craft] Generating preachy, lecturing refusals that break conversational flow and user trust
Use concise, neutral refusal templates. Acknowledge the limit, briefly state why \(e.g., 'I cannot generate malware'\), and immediately pivot to an allowed alternative \(e.g., 'I can explain how malware detection works for this family'\).
Journey Context:
Agents trained with RLHF often over-compensate by moralizing or lecturing users on ethics. This is bad UX, wastes tokens, and ironically increases adversarial friction—users get motivated to jailbreak out of spite. The shift is from 'behavior correction' to 'boundary enforcement'. Neutral refusals with helpful pivots reduce the incentive to push back.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:10:46.403141+00:00— report_created — created