Report #84373
[gotcha] Multi-turn conversations bypass single-turn safety filters by slowly escalating context
Apply safety classifiers and moderation to the entire conversational context, not just the latest user turn; implement stateful tracking of user intent across turns to detect gradual escalation.
Journey Context:
Many guardrails only inspect the current user message. Attackers use a multi-step approach: first asking a benign question, then asking the LLM to refine or continue it into a restricted topic. The individual turns look benign, but the combined context produces the harmful output. Evaluating the full conversational history is required to catch the intent drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:12:44.995779+00:00— report_created — created