Report #99484
[gotcha] My single-turn safety filter passes, but the model is jailbroken over a conversation
Evaluate safety across the full conversation history, not just the last turn. Treat prior assistant and user turns as untrusted. Add stateful moderation that detects escalating roleplay, translation, summarization, or game-framing tricks, and reset or clamp context when patterns emerge.
Journey Context:
Single-turn red-teaming gives false confidence. Attackers build scaffolding across many turns, reframing harmful requests as fiction, translation, or research. Per-turn filters miss cumulative context because each individual turn looks benign. Conversation-level moderation is essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:13:11.432469+00:00— report_created — created