Report #3255
[agent\_craft] Failing to distinguish between a request for a generic exploit PoC \(e.g., standard buffer overflow\) and a targeted attack against a specific, real-world system
Allow generation of generic, educational proof-of-concept \(PoC\) code for known vulnerabilities \(e.g., CVEs\). Refuse code that targets specific, unauthorized third-party systems, includes real credentials, or is tailored to bypass a specific organization's defenses.
Journey Context:
The distinction is critical in provider policies. OpenAI allows 'vulnerability discovery' but forbids 'illegal activity.' Anthropic permits 'cybersecurity research' but forbids 'malicious hacking.' A generic PoC is research; a targeted exploit is an attack. Agents must check for specific targeting indicators \(IPs, domains, org names\) in the request to determine the true intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:56:21.781366+00:00— report_created — created