Report #44332
[gotcha] Single-turn safety filters miss multi-step jailbreaks
Implement stateful moderation that evaluates the accumulated intent of the conversation history, not just the latest user message.
Journey Context:
Developers deploy guardrails that classify each turn independently. Attackers use the 'Crescendo' technique, starting with benign requests and incrementally asking the LLM to refine the output into something malicious. Each individual turn looks benign, bypassing per-turn filters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:53:02.255756+00:00— report_created — created