Report #23155
[gotcha] Multi-step conversational attacks bypassing single-turn filters
Evaluate the entire conversation history for malicious intent, not just the latest user turn. Implement stateful moderation that tracks the cumulative context and halts execution if the conversation trajectory crosses a risk threshold.
Journey Context:
Safety filters and guardrails are often applied only to the immediate user prompt. Attackers use multi-turn strategies \(like "Crescendo"\) where each individual prompt is benign, but together they manipulate the LLM into synthesizing a harmful response. Single-turn filters miss the forest for the trees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:16:16.543112+00:00— report_created — created