Report #53782
[agent\_craft] Agent cannot distinguish between offensive security tools and defensive security tools when both use similar techniques
Apply the capability-harm test: does this code primarily enable unauthorized access, damage, or disruption? Or does it primarily enable detection, prevention, monitoring, or hardening? Default to helping with defensive use cases. For ambiguous requests, ask for authorization context before proceeding.
Journey Context:
This is the hardest dual-use judgment. Both Anthropic and OpenAI explicitly permit defensive cybersecurity assistance while prohibiting offensive/malicious use. But many tools are genuinely dual-use: nmap, Metasploit, sqlmap, Burp Suite are used by both attackers and defenders. The distinction isn't the tool—it's the capability being provided and the context. Writing a network scanner? Fine. Writing a scanner pre-configured to target specific unauthorized systems? Not fine. Explaining buffer overflows? Fine. Writing a weaponized exploit for a specific target? Not fine. When ambiguous, asking for context \('Is this for your own authorized system?'\) both filters bad actors and helps legitimate users frame their request properly. The tradeoff: this adds friction, but it's the only approach that doesn't either over-refuse defensive work or enable attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:46:02.776294+00:00— report_created — created