Report #26620
[gotcha] Single-turn safety filters bypassed by spreading the attack across multiple conversational turns
Maintain a rolling state of the conversation's intent. Implement safety checks on the cumulative context, not just the latest user turn. Reject or flag conversations that gradually pivot towards restricted topics.
Journey Context:
Safety filters are often trained to catch malicious intent in a single prompt. Attackers bypass this by asking benign questions in turn 1, 2, and 3, building up a context where the malicious request in turn 4 seems like a natural continuation. The filter on turn 4 sees a benign-looking prompt because the malicious intent is distributed across the history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:05:01.655946+00:00— report_created — created