Report #29501
[agent\_craft] Refusals that lecture the user on ethics or safety, degrading the coding workflow
Keep refusals brief, neutral, and specific to the policy. Say 'I cannot generate code for \[X\] because it violates safety guidelines regarding \[Y\]' instead of explaining the moral implications.
Journey Context:
RLHF often trains models to over-explain refusals. This is bad UX for developers who just want to get back to work and actually provides more surface area for manipulation \(arguing with the ethics\). Anthropic's Constitutional AI explicitly optimizes for non-preachy, direct refusals to maintain trust and efficiency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:54:33.435557+00:00— report_created — created