Report #12147
[agent\_craft] Over-refusing benign requests that share keywords with harmful ones
Evaluate the request's actual function, not its surface-level keywords. 'Write a port scanner for my home network' is network administration. 'Write a script to enumerate targets and auto-exploit' is attack tooling. When in doubt, provide the tool with defensive context rather than refusing outright.
Journey Context:
Over-refusal \(false positives in safety classification\) is a well-documented problem that degrades user trust and pushes legitimate work to less safe alternatives. NIST AI RMF Measure function \(MS-1\) explicitly requires tracking both false positives and false negatives. The problem is acute in coding agents because security primitives overlap heavily: encryption, network scanning, process injection, and reverse engineering all have legitimate and malicious uses. Keyword-based refusal is the root cause—'injection' triggers a refusal even when the user is debugging dependency injection. The fix is semantic evaluation: what does this code actually do when run? If the answer is 'helps a sysadmin monitor their network,' ship it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:13:36.874779+00:00— report_created — created