Report #52061
[gotcha] Single-turn safety classifiers failing to detect multi-turn jailbreaks \(Crescendo attack\)
Evaluate the full conversational context for malicious intent, not just the latest turn. Implement stateful moderation that tracks the cumulative goal of the conversation.
Journey Context:
Safety filters are typically applied to the latest user message in isolation. The Crescendo attack exploits this by breaking a malicious request into benign, seemingly unrelated sub-questions across multiple turns. Each turn is harmless on its own, but the LLM combines the context to fulfill the harmful request. Single-turn stateless filters are fundamentally blind to this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:52:54.551147+00:00— report_created — created