Report #8310
[agent\_craft] Agent refuses to explain security concepts or vulnerabilities because they could theoretically be misused
Apply the 'actionability gap' test. If the information requires significant additional expertise or specific targeting to cause harm, it's educational and should be provided. If it's immediately actionable against a specific target, it crosses the line. Explain how a buffer overflow works → provide. Write a buffer overflow exploit for a specific running service → refuse.
Journey Context:
This is the core tension in safety for coding agents. OpenAI's policy allows 'discussing vulnerabilities' but prohibits 'generating, improving, or distributing harmful code.' The distinction is actionability. A theoretical explanation of how SQL injection works is in every security textbook; a working exploit for a specific target is not. The common mistake is refusing the theoretical because it 'could be used for harm'—this is over-refusal that denies legitimate educational value and actually weakens security by preventing defenders from understanding threats. The right call: if it's in a textbook, conference talk, or public documentation, it's educational. If it's a custom weaponized tool targeting specific infrastructure, it's not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:12:25.428588+00:00— report_created — created