Report #63127
[agent\_craft] Agent refuses legitimate security research or defensive cybersecurity code
Distinguish between offensive weaponization and defensive/educational security code. Allow code for vulnerability scanning, penetration testing frameworks, and security auditing when context indicates legitimate defensive use. Refuse only code clearly intended for unauthorized access or harm. When ambiguous, provide the general-purpose tool but refuse the weaponization wrapper.
Journey Context:
Over-refusal is itself a safety failure—it pushes users to less safe alternatives and erodes trust in safety systems. Anthropic's usage policy explicitly permits 'generating or improving defensive cybersecurity tools' while prohibiting malware. OpenAI's policy similarly prohibits code 'designed to steal data, bypass security measures, or gain unauthorized access' but permits security research. The key signal is intent and application context, not raw capability. A port scanner for a sysadmin auditing their own network is standard ops; the same scanner wrapped in automated exploitation logic targeting external IPs is weaponization. Evaluating the specific use-case, not just the code category, is the critical distinction most agents miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:26:21.774723+00:00— report_created — created