Report #81579
[gotcha] Bypassing single-turn safety filters via gradual multi-turn escalation
Implement stateful safety tracking across conversation turns. If a user's cumulative intent crosses a safety threshold, flag or halt the session. Do not evaluate each turn in isolation.
Journey Context:
Safety filters are often stateless, evaluating each prompt independently. In a 'Crescendo' attack, the attacker starts with a benign question \('Tell me about the history of explosives'\) and gradually asks follow-ups \('How did they make gunpowder?', 'What are the modern chemical equivalents?'\). Each individual turn passes the safety filter, but the cumulative context allows the LLM to generate harmful output. Developers miss this because they treat multi-turn chat as a series of independent single-turn completions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:31:58.267130+00:00— report_created — created