Report #70076
[gotcha] Roleplay and persona adoption bypassing safety training
Include explicit instructions in the system prompt that the model must not adopt personas that violate safety guidelines, and use classifier models to detect and block roleplay-based jailbreaks.
Journey Context:
Safety training often relies on the model refusing harmful requests. However, if the model is convinced it is playing a character \(e.g., 'an evil AI', 'a hacker'\), it may bypass its safety training because it interprets the harmful output as part of the fictional scenario.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:12:08.768836+00:00— report_created — created