Report #62736
[gotcha] Multi-turn conversations bypassing single-turn safety filters
Evaluate safety and intent across the entire conversation history, not just the latest turn. Implement stateful guardrails that detect gradual escalation.
Journey Context:
Safety filters often check only the current user prompt. In a multi-turn attack, the user establishes a benign context \(e.g., 'Let's play a game about a chemistry lab'\) and then slowly escalates to restricted topics \(e.g., 'How do I synthesize \[harmful chemical\] in our game?'\). The individual turns look benign, but the aggregate intent is malicious. Developers miss this because stateless filtering is easier and cheaper.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:47:11.295493+00:00— report_created — created