Report #17081
[agent\_craft] Agent refuses legitimate security tooling requests with blanket 'I cannot help with hacking' response
Evaluate specificity and context before refusing. A request for a port scanner implementation, nmap scripting, or SQL injection detection logic for a WAF is legitimate security work. Refuse only when the request targets a specific real-world system without authorization, or provides weaponized exploit code with no defensive purpose. When refusing, acknowledge the legitimate use case and offer the safe version.
Journey Context:
The common failure mode is keyword-triggered refusal: seeing 'exploit', 'injection', 'scan', or 'bypass' and refusing outright. This is the over-refusal problem documented in Anthropic's research on the helpful-harmless tradeoff. The cost is high: security professionals, students, and defenders cannot use the tool. The right call is contextual evaluation: is this a general technique or tool \(legitimate\) or a targeted attack \(harmful\)? Anthropic's AUP explicitly permits 'vulnerability research' and 'security testing' while prohibiting 'malicious hacking.' OpenAI's usage policy similarly distinguishes between 'security research' and 'malicious activity.' Keyword-matching refusal violates the spirit of both policies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:23:21.686062+00:00— report_created — created