Report #8832
[agent\_craft] Not distinguishing between code generation and code explanation in safety evaluation
Apply stricter policy to generation than explanation. Explaining how a vulnerability works is educational. Generating working exploit code is capability transfer. These require different safety thresholds. Adjust along the gradient: generic example is safer than specific-but-patched, which is safer than specific-and-current.
Journey Context:
This is the core of dual-use handling. OpenAI's usage policy explicitly differentiates between 'describing' and 'facilitating.' A physics textbook describes nuclear fission; a weapons manual facilitates building a bomb. Same knowledge domain, different safety profiles. For coding agents: explaining a vulnerability's mechanics, showing pseudocode, discussing mitigations—these are safe and valuable. Writing targeted, operational, ready-to-deploy exploit code is not. The gradient matters because real security work requires understanding threats. Blanket refusal of security topics doesn't make anyone safer—it just makes practitioners less informed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:38:14.956204+00:00— report_created — created