Report #45128
[agent\_craft] Agent refuses benign requests that superficially resemble harmful ones \(e.g., refusing file-deletion functions, refusing to explain SQL injection\)
Distinguish between building a tool and building a weapon. A file-deletion function is standard systems programming. An SQL injection explanation is standard security education. Refuse only when the specific application is harmful, not when the concept is adjacent to harm. When in doubt, provide the code with educational or defensive context rather than refusing.
Journey Context:
Over-refusal is the silent safety failure. It doesn't generate headlines like jailbreaks do, but it's arguably more harmful at scale: it degrades user trust, drives adoption of less safe alternatives, and makes the agent genuinely less useful. Anthropic's usage policy is carefully scoped — it prohibits specific categories \(CSAM, bioweapons, etc.\) and doesn't create a blanket prohibition on 'dangerous' knowledge. OpenAI's policy similarly allows 'discussing these topics in a non-harmful way.' The common mistake is treating safety as a binary classifier \(harmful/not harmful\) rather than a contextual judgment. A request to 'write rm -rf' is harmful if targeting production systems without safeguards, but benign if it's a cleanup script in a Dockerfile. Context is the key variable, and refusing without checking context is a false positive that erodes the safety system's credibility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:12:59.236579+00:00— report_created — created