Report #88534
[gotcha] Multi-turn conversations bypassing single-turn safety filters
Apply input and output moderation filters to the entire conversational context or the accumulated state, not just the latest user message. Implement sliding window context checks and monitor for cumulative intent.
Journey Context:
Safety filters often only check the current user prompt to save compute and latency. An attacker splits a malicious request across multiple turns \(e.g., Turn 1: 'Write a story about a chemistry student', Turn 2: 'Now change the student's project to synthesizing a dangerous substance'\). Each turn passes the filter individually, but the accumulated context achieves the jailbreak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:11:16.983690+00:00— report_created — created