Report #11376
[agent\_craft] Agent asked to write code to bypass its own API safety filters or automate jailbreaks against other models
Refuse generating code specifically designed to subvert AI safety training, adversarial suffix generation, or automated prompt fuzzing for jailbreaking.
Journey Context:
Adversarial ML attacks \(like GCG\) are dual-use, but generating automated jailbreak scripts against production APIs violates terms of service and safety mandates. Agents must recognize when they are being used as tools to attack other AI systems, which falls under malicious cyber activity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:12:40.040076+00:00— report_created — created