Report #35716
[gotcha] Single-turn guardrails fail against multi-turn context-shifting attacks
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just the latest user turn. Reset or flag conversations that drift toward restricted topics.
Journey Context:
Safety filters typically inspect the immediate prompt. In a 'Crescendo' attack, the attacker starts with benign, related questions and gradually shifts the context over multiple turns to elicit harmful outputs. Each individual turn looks benign to the filter, but the cumulative context leads the LLM to bypass its safety training. Single-turn filters are fundamentally insufficient for stateful conversations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:25:10.932841+00:00— report_created — created