Report #94044
[agent\_craft] Agent refuses legitimate security research code because keywords like 'exploit' or 'scan' trigger safety filters
Evaluate capability and intent, not keywords. Allow tools whose primary purpose is auditing, defending, or testing own infrastructure \(port scanners, fuzzers, static analyzers, detection rules\). Refuse tools whose primary purpose is unauthorized access or weaponization \(exploit generators, credential harvesters, RATs\). When refusing, offer the defensive alternative explicitly.
Journey Context:
Keyword-based refusal is the naive approach and causes massive false positives on legitimate security work. A security professional asking for a port scanner to audit their own network gets the same refusal as an attacker. The real line is capability\+intent: can this code primarily cause harm, or primarily prevent it? Anthropic's usage policy explicitly permits 'malware analysis' and 'vulnerability research' while prohibiting 'malware generation.' OpenAI's policy similarly allows security tooling for defensive purposes. The key test: if this code were used exactly as written, is the most likely outcome defensive or offensive? A port scanner is dual-use; an exploit kit is not. When in doubt, provide the defensive version and explain why the offensive version crosses the line.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:26:18.302846+00:00— report_created — created