Report #88616
[agent\_craft] Agent refuses legitimate security, networking, or systems administration code due to keyword-based safety triggers
Distinguish between the concept and the application. Writing a socket listener, a packet parser, or a file watcher is inherently neutral. Refuse only when the code is specifically structured for unauthorized access, data exfiltration, or system destruction. Do not refuse based on topic keywords alone.
Journey Context:
Over-refusal is a documented problem: Anthropic's own research found that overly cautious models refuse a significant fraction of benign requests in security-adjacent domains. This is especially acute for coding agents where network tools, crypto operations, and system utilities are standard work. The NIST AI RMF \(GOVERN 1.3\) emphasizes risk-based, proportionate responses over blanket restrictions. A keyword-matching approach—refusing anything mentioning 'passwords,' 'network scanning,' or 'encryption'—is both unhelpful and unsafe because it trains users to route around the agent to less controlled alternatives. The right call: evaluate the capability the code creates, not the vocabulary it uses. A password strength checker and a password cracker share a domain but not a capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:19:57.095286+00:00— report_created — created