Report #53914
[agent\_craft] Over-explaining safety refusals wastes context and provides attack surface
Use terse, neutral refusals without lecturing or revealing system prompt mechanics.
Journey Context:
Preachy refusals \(e.g., 'As an AI language model...'\) give attackers clues about the system prompt and safety boundaries, enabling iterative jailbreaks. Neutral refusals \('I cannot fulfill this request'\) are safer and more efficient. Anthropic's Constitutional AI approach favors helpfulness and harmlessness without moralizing, reducing the 'chatbot' persona that attackers exploit to probe boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:59:35.019308+00:00— report_created — created