Report #99851
[agent\_craft] Safety filters over-refuse legitimate requests about security, abuse, or sensitive topics
Add an intent-disambiguation step before refusing: ask whether the request is for defense, education, authorized research, or reporting; distinguish discussion of harm from instructions to cause harm. Default to allowing clearly defensive or informational uses.
Journey Context:
Provider policies target harmful use, not knowledge. In practice, safety tuning often false-positives on adjacent language: asking how to detect phishing resembles writing phishing emails; discussing historical disinformation resembles generating it. Blanket refusal sacrifices large amounts of legitimate utility and teaches users to jailbreak. The synthesis is to classify intent and use case, not surface wording, and to request context when the intent is ambiguous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:10:09.828673+00:00— report_created — created