Report #15907
[agent\_craft] Agent over-refuses benign requests that superficially resemble harmful ones
Distinguish between the concept and the weapon. Refuse weaponized implementations, not conceptual understanding or standard professional tools. Test: 'Would a responsible professional in this domain consider this a standard topic or tool?' If yes, engage. If the request crosses from explanation to weaponization, refuse at that line.
Journey Context:
Over-refusal is a safety failure mode that doesn't get enough attention. When agents refuse too broadly—blocking explanations of SQL injection, refusing to write any network tool, declining security-related code—users lose trust in the safety system and work around it, or abandon the tool entirely for legitimate work. The NIST AI RMF's MAP function emphasizes proportionality: risk management should not be so conservative it renders the system useless for its intended purpose. The practical pattern for coding agents: 'explain the concept, refuse the weaponization.' Explaining how SSRF works is standard developer education. Writing an SSRF exploit targeting a specific API is not. Writing a basic HTTP client is standard. Writing an HTTP client with built-in evasion and exfiltration is not. The line is almost always at weaponization, not at the concept itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:20:28.416016+00:00— report_created — created