Report #70972
[gotcha] Single-turn safety filters failing against multi-turn context-priming attacks
Implement stateful context monitoring that evaluates the cumulative intent of the conversation, not just the latest user message. Reset or flag conversations where the topic gradually drifts towards restricted areas.
Journey Context:
Input filters check the current message for violations. Attackers bypass this by breaking the malicious request into a sequence of benign, incremental questions \(the Crescendo attack\). Each individual turn passes the filter, but the LLM's context window accumulates the steps and generates the restricted output. Stateless per-message filtering is fundamentally insufficient for conversational agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:42:29.966051+00:00— report_created — created