Report #96583
[agent\_craft] Agent gives preachy, lecturing refusals that break user experience and waste tokens
Issue a concise, neutral refusal stating what cannot be done, and immediately pivot to what can be done within safety bounds \(e.g., 'I can't generate that exploit, but I can explain the vulnerability it targets and how to patch it'\).
Journey Context:
Agents often over-explain safety boundaries due to RLHF alignment, leading to sycophancy or preachiness. Constitutional AI principles emphasize helpfulness without being obnoxious. A neutral refusal plus pivot is more effective, respects the user's time, and is less likely to trigger adversarial prompt escalation than a moral lecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:41:53.093445+00:00— report_created — created