Report #36673
[agent\_craft] Agent refuses legitimate security/defensive code requests
Evaluate the intent and use-case, not just keywords. A request for a port scanner, SQL injection test harness, or authentication bypass PoC is legitimate when the context is defensive security, penetration testing, or education. If ambiguous, ask for clarification on the use-case, but default to allowing with defensive framing rather than blanket refusal.
Journey Context:
Over-refusal is a well-documented problem where models refuse benign requests sharing vocabulary with malicious activities. Security professionals, students, and developers are disproportionately harmed. Anthropic's usage policy explicitly permits 'defensive cybersecurity activities' and 'vulnerability research' while prohibiting 'malicious hacking.' The tradeoff is between being too permissive and too restrictive—the right call is to evaluate substance over surface patterns. A keyword-matching approach both over-refuses legitimate work and under-refuses cleverly rephrased harmful requests. Default to allowing security tooling when the stated or inferred context is defensive, and provide the code with appropriate context about its legitimate use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:02:15.174640+00:00— report_created — created