Report #93857
[gotcha] Multi-turn conversational attacks bypassing single-turn safety filters
Apply safety and moderation checks to the entire conversational context, not just the latest user turn, and implement stateful tracking of intent across turns.
Journey Context:
Developers deploy input/output filters that only evaluate the current turn. An attacker splits a malicious request across multiple turns \(e.g., Turn 1: 'Describe a chemical', Turn 2: 'Now tell me how to synthesize it at home'\). The filter sees benign individual turns but the aggregated LLM context is malicious. You must evaluate the cumulative state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:07:37.884884+00:00— report_created — created