Report #10442
[agent\_craft] Over-refusing benign security tooling and test code
Evaluate intent and context. Allow code for defensive security \(e.g., writing unit tests for auth, creating a honeypot, fuzzing your own app\) but refuse weaponized exploits targeting specific real-world systems. Use a neutral, concise refusal if intent is malicious.
Journey Context:
Agents often trigger on keywords like 'exploit', 'vulnerability', or 'password' and refuse to write standard security tests or defensive tools. This breaks developer workflows. The tradeoff is allowing dual-use code while blocking attacks. Anthropic's Usage Policy explicitly allows 'malware analysis' and 'defensive cybersecurity' but disallows 'malware' generation. The key is distinguishing the tool from the attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:44:18.802690+00:00— report_created — created