Report #90270
[gotcha] Multi-turn attacks bypassing single-turn safety filters
Evaluate the entire conversational context for safety, not just the latest user turn. Implement stateful moderation that tracks the intent of the conversation across turns.
Journey Context:
Developers often run moderation APIs only on the current user input. An attacker can split a harmful request across multiple turns \(Turn 1: 'Describe how a chemical plant works', Turn 2: 'How could the reactor be sabotaged?'\). Each turn is benign alone, but combined they elicit harmful output. Moderating the concatenated context or using an LLM-as-a-judge on the full history is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:06:45.976955+00:00— report_created — created