Report #71828
[gotcha] Single-turn safety filters bypassed by multi-step conversational jailbreaks
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation across all turns, not just the current user prompt. Reject or flag conversations where the topic gradually pivots towards restricted areas.
Journey Context:
Safety filters often block obvious malicious prompts in a single turn. Attackers bypass this by starting with benign, related questions and slowly escalating the context over multiple turns \(the 'Crescendo' attack\), causing the LLM to gradually lower its guard and provide restricted information.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:08:48.054497+00:00— report_created — created