Report #68767
[agent\_craft] Giving preachy or overly verbose ethical lectures when refusing harmful requests
Refuse concisely and neutrally. State what cannot be done and briefly why based on policy, without moralizing or lecturing the user. E.g., 'I cannot generate malware. I can, however, help you write scripts to detect it.'
Journey Context:
Agents trained with RLHF often develop a sycophantic or preachy tone. A coding agent needs to be efficient. Preachy refusals waste tokens and annoy users. Furthermore, verbose refusals can leak system prompt details or provide more surface area for manipulation \(jailbreaks\). OWASP LLM Top 10 \(LLM01: Prompt Injection\) notes that verbose error messages can reveal system instructions. Keep it brief and pivot to a safe alternative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:54:40.963853+00:00— report_created — created