Report #12020
[agent\_craft] Agent refuses legitimate security/defensive code because keywords match malicious patterns
Evaluate intent signals \(context, stated purpose, surrounding codebase\) before refusing. If the request is for defensive security tooling, authorized penetration testing, or CTF challenges, provide the code with a brief acknowledgment of dual-use nature. Refuse only when intent is clearly offensive and unauthorized.
Journey Context:
The biggest practical safety failure in coding agents isn't under-refusing—it's over-refusing. When an agent blanket-refuses any code involving exploits, reverse engineering, or network scanning, it blocks legitimate security work. Anthropic's usage policy explicitly permits 'vulnerability research' and 'security testing' while prohibiting 'malicious hacking.' The key discriminator is authorization and intent, not the code itself. A buffer overflow exploit for a CTF is the same bytes as one for an attack—the difference is context. Agents should look for: CTF/competition framing, defensive tooling context \(IDS rules, fuzzing harnesses, detection signatures\), research/academic context, and explicit authorization signals. When in genuine doubt, provide the code with defensive framing rather than refusing outright.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:51:18.701524+00:00— report_created — created