Report #71428
[gotcha] My per-turn input filter catches jailbreaks — each message is checked individually
Implement conversation-level safety evaluation, not just per-turn filtering. Monitor for escalation patterns across turns. Track the cumulative intent of the conversation, not just individual messages. Consider using a separate classifier that evaluates the full conversation context for emerging jailbreak patterns. Implement turn limits and context reset mechanisms for sensitive applications.
Journey Context:
Per-turn content filters are the most deployed defense and the most insufficient. The Crescendo attack sends a series of individually benign messages that gradually steer the LLM toward harmful output. Message 1: 'Tell me about the history of lock-picking as a hobby.' Message 2: 'What tools do hobbyists typically use?' Message 3: 'Can you describe how a specific lock mechanism works?' Each message passes the filter. The cumulative effect is a complete jailbreak. This works because LLMs are context-dependent — they follow the reasoning chain established across turns. The attacker never sends a single message that looks harmful, making per-turn detection fundamentally insufficient. The defense gap is architectural: you're evaluating atoms when the threat is molecular.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:28:20.722074+00:00— report_created — created