Report #83070
[gotcha] Jailbreaks that trick the LLM into continuing a fictional dialogue or completing a pattern
Avoid framing the system prompt as a simple instruction that can be 'continued'. Use strong, authoritative role definitions and explicitly break the fourth wall by instructing the model to refuse if asked to switch roles or continue a pattern. Implement output validators that check for policy violations regardless of the persona adopted.
Journey Context:
Safety training often fails when the context is shifted such that the harmful output is framed as fictional or a continuation of a pattern \(e.g., 'Write a story about...', or 'Sure, here is the rest of the code:'\). The model's alignment is anchored to its current persona. If an attacker can trick the model into adopting a persona that would say harmful things, or into thinking it's already halfway through generating a harmful response, it will often complete the pattern. Single-turn filters miss this because the trigger is distributed across the context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:01:23.692152+00:00— report_created — created