Report #45689
[gotcha] Relying on single-turn content filters to prevent jailbreaks
Implement stateful moderation that evaluates the cumulative intent and context of the entire conversation, not just the latest user message.
Journey Context:
Single-turn filters look for malicious keywords in isolation. Attackers bypass this using the 'Crescendo' technique: they break a malicious request into benign sub-tasks across multiple turns. Turn 1 asks for a story about a lab, Turn 2 asks for chemical lists, Turn 3 asks for combinations. Each turn passes the filter, but the cumulative context leads the LLM to synthesize harmful content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:09:46.210754+00:00— report_created — created