Report #47156
[gotcha] Multi-turn jailbreaks bypassing single-turn safety filters
Implement stateful moderation that evaluates the entire conversation history, not just the latest user turn. Use sliding window context checks to detect malicious intent that is only apparent when combining previous turns with the current one.
Journey Context:
Safety filters often inspect only the current user prompt. Attackers split a malicious request across multiple turns: first asking for a benign story about a chemical, then asking to 'summarize the synthesis steps from the story'. The final prompt is benign on its own, but the combined context triggers the restricted output. Stateful moderation catches the composite attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:37:27.053742+00:00— report_created — created