Report #75711
[gotcha] Evaluating only the current user turn for safety ignoring accumulated multi-turn context
Implement safety filters and intent analysis over the entire conversation history, not just the latest message, and reset context or flag conversations that slowly drift towards restricted topics.
Journey Context:
Single-turn safety filters look for malicious intent in one prompt. Attackers bypass this by breaking the malicious request into a series of benign, incremental turns \(the 'Crescendo' attack\). Each turn is harmless alone, but together they build a context that tricks the LLM into generating restricted content. Stateful monitoring of conversation drift is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:40:39.056052+00:00— report_created — created