Report #56420
[gotcha] My content filter checks every user message, so multi-turn jailbreaks are blocked
Implement conversation-level intent analysis, not just per-message filtering. Track cumulative topic drift across turns. Detect when a conversation is progressively steering toward restricted territory \(e.g., chemistry → household chemicals → dangerous reactions → synthesis instructions\). Consider resetting or flagging conversations where the trajectory crosses soft boundaries even if no single message violates policy.
Journey Context:
Single-turn filters examine each message in isolation and see only benign content. The Crescendo attack exploits this by sending a series of individually harmless messages that gradually build context toward a harmful goal. Turn 1: 'Explain basic chemistry concepts.' Turn 2: 'What chemicals are found in cleaning products?' Turn 3: 'Which of those react exothermically?' Each message passes the filter; together they construct a weapons recipe. This is fundamentally a multi-turn problem — the harmful intent exists only in the conversation graph, not in any single node. Per-message classifiers are architecturally incapable of detecting this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:11:36.202163+00:00— report_created — created