Report #40735
[gotcha] Single-turn safety filters fail against multi-turn contextual jailbreaks
Evaluate safety across the entire conversation context, not just the latest user turn, and implement stateful moderation that tracks the intent of the dialogue.
Journey Context:
Developers deploy moderation APIs that only check the current user prompt. Attackers use multi-turn approaches \(like 'Crescendo'\) where each individual prompt is benign, but they gradually build up context that forces the LLM to output malicious content in a later turn. The filter never sees a violation because no single turn is malicious.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:50:46.919113+00:00— report_created — created