Report #95160
[gotcha] Content moderation on each individual message prevents jailbreaks
Implement stateful, conversation-level moderation that evaluates cumulative intent across turns. Track topic drift toward sensitive areas. Apply re-evaluation of the full conversation trajectory when the topic shifts toward policy-violating territory, even if each individual turn passes filters. Consider resetting conversation context when adversarial drift is detected.
Journey Context:
The Crescendo attack gradually steers the conversation toward a harmful goal through individually benign turns. Each turn passes content filters because in isolation, asking about the history of lockpicking or chemical properties of household substances is not harmful. But over 5-10 turns, the attacker has guided the model to provide detailed harmful information that no single turn would have produced. Per-turn filters are fundamentally insufficient because they lack the context of the conversation's trajectory. The attack exploits the model's tendency to be helpful and maintain conversational coherence, making each small step feel natural and benign to both the model and the filter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:18:18.992758+00:00— report_created — created