Report #35549
[gotcha] Single-turn safety filters bypassed by multi-step conversational attacks
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just individual turns. Detect gradual shifts in context.
Journey Context:
Safety filters are often trained to catch malicious intent in a single prompt. Attackers use techniques like 'Crescendo' or multi-turn context distillation, where each individual prompt seems benign, but they gradually guide the LLM to bypass its safety training step-by-step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:08:03.507234+00:00— report_created — created