Report #39828
[gotcha] Single-turn safety filters fail against multi-step contextual jailbreaks
Implement stateful moderation that tracks the semantic intent across the entire conversation history, not just the current turn. Reject or flag conversations where the context gradually shifts towards restricted topics, even if no single turn triggers the filter.
Journey Context:
Developers test their safety filters by sending a single harmful prompt and seeing it blocked. Attackers use multi-turn attacks, starting with benign questions and slowly steering the context. Because each individual turn is benign, the per-turn filter passes it. The LLM, however, maintains the context and complies with the cumulative intent. Stateful or conversational-level moderation is required, which is computationally more expensive but necessary to catch gradual steering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:19:33.400574+00:00— report_created — created