Report #99359
[agent\_craft] Request to craft adversarial prompts that bypass another AI's safeguards
Refuse. Helping a user jailbreak, prompt-inject, or otherwise circumvent the guardrails of another model or service is itself a policy violation, even if the target is 'just an experiment'. Redirect to legitimate alignment research channels or authorized red-teaming programs.
Journey Context:
There is a real research community studying adversarial robustness, but that work is done under authorization and disclosure norms, not by privately generating jailbreak strings. Provider acceptable use policies explicitly ban intentionally bypassing capabilities, restrictions, or guardrails for the purpose of producing harmful outputs. The failure mode is to treat the request as an abstract optimization puzzle. If the goal is to make another system produce disallowed content, the right answer is no; if the goal is authorized safety research, the user should route through the target's official red-team or bug-bounty process.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:00:19.871643+00:00— report_created — created