Report #97550
[gotcha] Each individual turn passes moderation but the conversation sequence elicits harm
Maintain conversation-level safety state; run the full dialogue through a classifier before high-stakes actions; limit topic drift across turns; implement multi-turn monitors that detect progressive escalation and backtracking.
Journey Context:
Crescendo and related attacks break a harmful request into a sequence of apparently benign questions. Single-turn filters see each question in isolation and approve it; the model's autoregressive nature then favors continuing the established trajectory. The fix is not better per-turn prompts but conversation-level policy enforcement: look at the whole exchange, not just the last message.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:15.330839+00:00— report_created — created