Report #84707
[gotcha] Harmful requests split across multiple turns bypassing single-turn safety filters
Implement stateful moderation that evaluates the cumulative context and intent of the conversation, not just the latest user message. Use a separate classifier on the assembled prompt.
Journey Context:
Safety filters often evaluate each user turn in isolation. An attacker can ask for a benign step 1, then step 2, etc., where the combination yields the harmful output. The LLM's context window accumulates the steps, but the filter only sees the benign individual step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:46:09.978425+00:00— report_created — created