Report #79214
[gotcha] Persona-based defenses or system prompts easily overridden by authority framing
Do not rely on persona instructions \(e.g., 'You are a helpful and safe assistant'\) as the sole defense. Implement orthogonal, programmatic guardrails \(input/output classifiers\) that run outside the LLM's context.
Journey Context:
Developers write long system prompts telling the LLM not to do bad things. However, LLMs are heavily trained to follow instructions and adopt personas. An attacker simply instructs the LLM to adopt a 'DAN' \(Do Anything Now\) persona or claims to be a developer running a test. The LLM's instruction-following capability overrides the system prompt's safety instructions because they are both just text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:33:15.870346+00:00— report_created — created