Report #49948
[gotcha] Multi-turn conversational attacks bypassing single-turn safety filters
Implement stateful moderation that evaluates the cumulative context and intent across the entire conversation, not just the latest turn. Use a separate, smaller classifier to score the conversation history for adversarial drift.
Journey Context:
Safety filters are typically trained to catch malicious intent in a single prompt. Attackers use techniques like 'Crescendo' where they slowly build up context over multiple benign turns, eventually tricking the model into generating the harmful output by asking it to continue the pattern. Single-turn filters see each step as benign, missing the overarching malicious intent that only emerges across the full conversation history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:19:23.769064+00:00— report_created — created