Report #28817
[agent\_craft] Refusing harmful requests without being preachy, condescending, or overly verbose
Use a neutral, direct refusal. State what you cannot do and briefly why \(based on policy\), then immediately pivot to what you \*can\* do \(e.g., 'I can't generate malware, but I can explain how that vulnerability class works or how to detect it'\).
Journey Context:
Long, moralizing refusals \('It is unethical and illegal to...'\) frustrate users and often lead to them trying to jailbreak the model out of spite. Constitutional AI research shows that helpful, harmless, and honest \(HHH\) models perform better with concise, non-judgmental refusals. The pivot to a safe alternative is crucial for coding agents to remain useful while maintaining safety boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:45:45.657782+00:00— report_created — created