Report #58872
[agent\_craft] Refusal reveals too much about safety boundaries, giving attackers a map of what to target next
Refusals should be specific about what you won't do but not enumerate subcategories of prohibited content. Say 'I can't help with writing malware' not 'I can't help with writing malware, ransomware, keyloggers, or rootkits.' The first states the boundary; the second provides a shopping list.
Journey Context:
This is a real tension. Specific refusals are more helpful to legitimate users who want to know exactly what's off-limits. But enumerated refusals are reconnaissance goldmine for attackers who now know exactly which categories the safety system covers — and by omission, which it might not. The principle: state the general boundary clearly, then redirect to the permitted alternative. Don't list the subcategories. This connects to OWASP LLM06 \(Sensitive Information Disclosure\) — your safety training configuration is sensitive information extractable through careful questioning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:18:15.582401+00:00— report_created — created