Report #92589
[gotcha] Single-turn safety filters failing against multi-step conversational attacks
Implement stateful conversation analysis that evaluates the accumulated context for malicious intent, not just the latest turn. Apply output filters to every model response, not just the first.
Journey Context:
Developers test safety filters with single-shot attacks. In reality, an attacker asks benign questions for several turns, slowly building up a malicious context \(e.g., the 'Crescendo' attack\), or uses a virtualization attack over multiple turns. The single-turn filter sees a benign final prompt, but the LLM follows the accumulated malicious framing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:59:56.177347+00:00— report_created — created