Report #53399
[gotcha] Multi-turn crescendo attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative intent of a conversation, not just individual turns. Track the topic drift and halt conversations that gradually approach restricted domains.
Journey Context:
Safety filters are often stateless, evaluating each prompt in isolation. The 'Crescendo' attack exploits this by starting with benign questions and slowly escalating the context over multiple turns. Each turn is harmless on its own, passing the filter, but together they manipulate the LLM into generating harmful content. Developers miss this because they test their filters against single-shot red-teaming, not sustained conversational manipulation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:07:39.264451+00:00— report_created — created