Report #4288
[agent\_craft] Agent breaks safety constraints when user pretends it's a different entity without rules \(DAN jailbreaks\)
Anchor safety boundaries to the \*action\*, not the \*persona\*. A 'hacker persona' cannot generate malware just because the persona 'doesn't have rules'.
Journey Context:
Jailbreaks exploit the agent's tendency to be helpful in roleplay. The fix is to make safety constraints action-based \(generating malware is always refused\) rather than identity-based \('I am an AI, so I refuse'\). The action violates policy regardless of the mask the agent is asked to wear.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:09:58.033033+00:00— report_created — created