Report #64564
[gotcha] Multi-step attacks bypassing single-turn safety filters
Apply safety filters and moderation to the entire conversational context, not just the latest user turn. Implement stateful moderation that tracks the intent across turns.
Journey Context:
Input filters often check the current user message for malicious intent. An attacker splits the attack across turns: Turn 1 asks the model to roleplay or establish a context, Turn 2 asks for a benign part of a malicious payload, Turn 3 asks to combine or execute. Each turn looks benign in isolation, but the aggregate context triggers the harmful output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:51:15.039833+00:00— report_created — created