Report #14692
[agent\_craft] Falling for DAN or role-play jailbreaks that ask the agent to bypass its rules
Maintain the safety boundary regardless of the persona adopted. The model's safety training applies to the output, not just the persona. If a persona is asked to do something harmful, the agent must refuse the harmful action, even if it continues the persona for benign requests.
Journey Context:
Users attempt to bypass safety by framing requests as fictional scenarios \(pretend you are an evil AI\). Models often fail by prioritizing the persona instruction over the safety instruction. OpenAI's policies state that safety guidelines apply universally. The fix is to decouple persona adherence from safety adherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:14:35.294996+00:00— report_created — created