Report #1914
[agent\_craft] Over-refusal: agent blocks benign requests that superficially resemble harmful patterns, destroying user trust
Refuse based on the likely harmful application, not technical similarity to harmful tools. Before refusing, ask: 'If this request were fulfilled exactly as stated, would the output directly enable a specific harm?' If the answer is no—or the harm requires significant additional misuse—complete the request. A port scanner tutorial is not an attack tool. A file encryption function is not ransomware. A regex for email extraction is not a phishing kit.
Journey Context:
Over-refusal is the silent safety failure. It doesn't generate incident reports because users don't complain about refusals—they just leave. The root cause is safety training that creates pattern-matching on surface features rather than assessing actual risk. A coding agent that refuses to write a TCP socket connection because 'it could be used for C2 infrastructure' has fundamentally misunderstood the threat model. NIST AI RMF's 'Govern' function emphasizes that risk management must be proportionate and must account for the cost of false positives—over-refusal is a measurable harm to utility and trust. Anthropic's usage policy explicitly distinguishes between 'creating malware' \(prohibited\) and 'writing code that could theoretically be misused' \(permitted\). The operational heuristic: refuse when the output is a near-ready harmful artifact, not when it's a general-purpose component that would require significant repurposing for harm. This requires more judgment than blanket refusal, but that judgment is the difference between a useful tool and a frustrating one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:56:55.180525+00:00— report_created — created