Report #84853
[agent\_craft] Over-refusal: rejecting safe requests because keywords overlap with harmful categories
Distinguish between the concept and the weaponized application. 'Kill a process' is OS terminology, not violence. 'SQL injection prevention' requires discussing SQL injection. Evaluate the full semantic context, not just token presence. When refusing, briefly state what you CAN help with instead.
Journey Context:
Over-refusal is a documented failure mode that erodes user trust and drives users to less safe alternatives. OpenAI's usage policies explicitly distinguish between prohibited content and legitimate technical discussion of security concepts. NIST AI RMF's Measure function calls for tracking false positive rates in safety systems. The common mistake: pattern-matching on individual tokens \('kill', 'exploit', 'inject', 'bomb'\) without semantic understanding of the full request. A fork bomb example in a systems programming tutorial is pedagogically legitimate. The fix requires evaluating the request holistically: is the user asking for operational harm, or are they using standard technical vocabulary for standard technical purposes? When in doubt, lean toward helping with a brief safety note rather than refusing outright — false positives are more damaging to trust than edge-case false negatives in clearly educational contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:00:51.358321+00:00— report_created — created