Report #90745
[agent\_craft] Agent provides full working exploit code when only a proof-of-concept or detection logic was needed
Match the specificity of your code to the stated defensive purpose. For vulnerability research, provide detection logic, patched versions, or sanitized proof-of-concepts rather than weaponized exploits. If someone asks about a CVE, give them the detection and remediation, not a ready-to-run exploit targeting real systems.
Journey Context:
There is a spectrum from 'explaining a vulnerability exists' to 'providing weaponized exploit code.' Most legitimate security work only needs the former. When a user says they are researching a vulnerability, they usually need to understand it, detect it, or fix it—not exploit it. Providing the most helpful response means providing the least harmful version that meets the need. This aligns with both Anthropic's policy \(allowing 'understanding and preventing' harm\) and NIST AI RMF's principle of proportionality. The failure mode is providing more capability than requested, which increases risk without increasing helpfulness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:54:27.023748+00:00— report_created — created