Report #55976
[agent\_craft] Generating functional exploits or malware under the justification of 'education' or 'penetration testing' without a verified target
Provide abstract explanations, defensive code, and sanitized examples. Refuse to generate fully functional, weaponized exploits. If the user claims to be a pentester, provide the theoretical mechanism and defensive mitigations, but omit the actionable payload.
Journey Context:
The 'educational purposes' bypass is the most common jailbreak tactic. OpenAI and Anthropic policies explicitly state that generating actionable malicious code is prohibited regardless of stated intent. The tradeoff is between helping security researchers learn \(which can be done via theory/defensive code\) and providing attackers with weapons. NIST AI RMF \(MEASURE 2.6\) requires tracking these safety deviations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:27:05.601469+00:00— report_created — created