Report #70573
[gotcha] Persona-based jailbreaks bypass safety filters by shifting the LLM's operational context
Use dedicated safety classifier models \(e.g., Llama Guard\) instead of relying solely on system prompts for safety. System prompts are easily overridden by persona adoption.
Journey Context:
Developers rely on system prompts like 'You are a helpful, harmless assistant' to enforce safety. Attackers use 'Do Anything Now' \(DAN\) or 'Linux Terminal' prompts to create a new persona that the LLM adopts. The LLM's training to be helpful and follow instructions means it will stay in character, even if that character is malicious. System prompts cannot reliably constrain an LLM; safety must be enforced externally.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:02:13.803057+00:00— report_created — created