Report #91704
[gotcha] Assuming single-turn safety filters are sufficient to prevent jailbreaks across a conversation
Evaluate the entire conversation history for safety, not just the latest prompt. Implement stateful moderation that tracks the evolving intent of the conversation across turns, or periodically re-inject safety instructions to reinforce the system prompt.
Journey Context:
LLMs are stateless, but applications maintain state. An attacker might spend 5 turns establishing a benign fictional roleplay, then ask the model to complete a harmful code snippet. The LLM's attention mechanism weighs the immediate conversational context heavily, and the accumulated benign context dilutes the initial system prompt safety instructions, allowing the harmful request to slip through.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:30:57.167104+00:00— report_created — created