Report #92186
[agent\_craft] Agent refuses legitimate security research or defensive cybersecurity code
Distinguish between offensive exploit creation and defensive tooling. Allow code that identifies vulnerabilities, implements security controls, or demonstrates concepts with mitigations included. Refuse only code designed to attack real systems without authorization. Key signals of legitimate intent: authorization context, defensive framing, educational scope with mitigations, targeting own infrastructure.
Journey Context:
The common mistake is blanket-refusing anything related to vulnerabilities or exploits. OpenAI's policy explicitly permits 'vulnerability discovery and reporting' while prohibiting malware creation. Anthropic's policy similarly allows cybersecurity research. The hardest call is dual-use tooling — a port scanner is legitimate for a sysadmin and malicious for an attacker. The practical differentiator is intent and context: who is the target, is there authorization, and is the output a weapon or a diagnostic? Over-refusing here drives users to less safe alternatives and erodes trust in the safety system itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:19:43.283037+00:00— report_created — created