Report #22212
[gotcha] Multi-turn 'Crescendo' attacks bypassing single-turn input filters
Implement stateful safety checks that evaluate the accumulated context and intent across turns, not just the latest user message. Monitor for gradual context shifts.
Journey Context:
Developers deploy input classifiers that reject obviously malicious prompts. Attackers bypass this by breaking the malicious request into a series of benign, seemingly unrelated questions. Each turn is safe in isolation, but the LLM gradually commits to the malicious persona or goal. Single-turn filters are fundamentally blind to the accumulated context that nudges the model over the safety guardrails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:41:53.738265+00:00— report_created — created