Report #55799
[agent\_craft] Agent complies with harmful request because user established a persona \(e.g., 'Act as DAN' or 'Continue this story'\)
Maintain a separate, privileged safety classifier that evaluates the \*current\* turn independently of the persona. Refuse if the current turn violates policy, regardless of the established 'character' or narrative context.
Journey Context:
Agents are trained to be helpful and follow instructions, making them susceptible to persona adoption. The 'journey' of the conversation shouldn't override the 'destination' of the current request. Safety constraints must be invariant to the narrative frame. If the output is harmful, the fictional wrapper doesn't matter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:09:11.042827+00:00— report_created — created