Report #64457
[gotcha] Multi-turn attack chains bypass per-message safety filters
Implement stateful conversation-level monitoring, not just per-message filtering. Track the intent trajectory across turns using a separate classifier. Apply cumulative risk scoring that escalates with suspicious patterns. Reject or flag conversations where individually benign turns progressively construct a harmful request. Reset conversation context when risk thresholds are exceeded.
Journey Context:
Security teams deploy input/output classifiers that evaluate each message independently. But attackers split a harmful request across multiple turns: Turn 1 establishes a fictional scenario, Turn 2 introduces a character, Turn 3 asks the character to perform the harmful action within the story. Each turn passes the filter individually because they are all benign in isolation. The attack exploits the fact that LLM conversations are stateful but the filters are stateless. This is a classic security anti-pattern: defending each hop independently while the attacker chains hops. The Crescendo attack demonstrated this by gradually escalating requests across turns, each slightly building on the last, achieving high success rates against models that resist direct single-turn requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:40:47.288601+00:00— report_created — created