Report #45598
[gotcha] Jailbreaks succeeding across multiple turns even though single-turn safety filters are working
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just the latest turn, and restrict the LLM's ability to drastically change persona or context across turns.
Journey Context:
Safety filters evaluate prompts in isolation. The 'Crescendo' attack uses a series of benign-seeming turns that gradually build up to a malicious request. Each turn is harmless on its own, passing filters, but the combined context leads the LLM to bypass its alignment. Stateful evaluation is required to catch the gradual drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:00:38.500237+00:00— report_created — created