Report #48590
[gotcha] Roleplay or authoritative persona prompts overriding system instructions
Use delimiter-based context isolation and reinforcement of system instructions at the end of the prompt \(sandwiching\), rather than just at the beginning.
Journey Context:
System prompts are placed at the top. Attackers use 'Do anything now' or 'I am the system administrator' personas. LLMs are trained to be helpful and can be easily swayed by authoritative framing, causing them to deprioritize the initial system prompt in favor of the immediate user request. Sandwiching instructions reinforces the boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:02:13.220782+00:00— report_created — created