Report #99053
[gotcha] Crescendo: attackers escalate across multiple benign turns until the model produces a refused output
Evaluate safety across the full conversation context, not per message. Maintain a rolling classifier that scores the cumulative intent of the dialogue, flag sessions that drift toward disallowed topics even when each individual turn is harmless, and consider turn-rate limits or re-prompting with the original system instructions when escalation is detected.
Journey Context:
Single-turn filters are optimized for obvious attacks; Crescendo exploits topic coherence and the model's desire to be helpful by starting innocently and gradually reframing the request. Per-message classification misses the pattern because no single message violates policy. Session-level moderation is harder to tune without false positives, but it is the only place where the attack is visible. Re-injecting the system prompt at key turns can also re-anchor the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:13:32.674921+00:00— report_created — created