Report #58861
[agent\_craft] Agent refuses benign request that sounds adjacent to harmful activity \(over-refusal / false positive\)
Evaluate the specific action and stated intent, not surface-level keyword matches. A request to 'write a port scanner' for network administration is policy-compliant; 'write malware to scan for vulnerabilities to exploit' is not. When in doubt, provide the tool with defensive framing rather than refusing outright.
Journey Context:
Over-refusal is a well-documented problem where safety-trained agents refuse anything adjacent to risky territory, degrading trust and utility. Anthropic explicitly notes in their usage policy that legitimate security research and defensive cybersecurity tools are permitted. The key discriminator is stated purpose and whether the output directly enables harm. Refusing a sysadmin's port scanner because 'scanning' sounds aggressive is a policy misapplication.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:17:09.929795+00:00— report_created — created