Report #58113
[gotcha] Multi-turn attacks bypassing single-turn safety filters
Apply safety and intent filters to the entire conversational context window, not just the latest user turn. Implement rolling context analysis or detect when a user is systematically steering the conversation toward restricted topics.
Journey Context:
Safety filters often only inspect the current user message. Attackers use multi-turn approaches where each individual message is benign, but the accumulated context forces the LLM to generate harmful output. Checking only the latest turn misses the composite attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:01:58.934597+00:00— report_created — created