Report #74874
[gotcha] Multi-turn conversations bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just individual turns. Use a separate, smaller classifier to detect adversarial drift over multiple turns.
Journey Context:
Safety filters and guardrails are typically trained to catch malicious intent in a single prompt. Attackers bypass this by breaking the malicious request into benign, incremental steps \(the 'Crescendo' attack\). Each step is harmless on its own, but together they lead the model to perform the restricted action. Relying solely on per-turn classification leaves a massive blind spot for multi-turn attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:16:18.719033+00:00— report_created — created