Report #24296
[gotcha] Multi-turn jailbreaks bypassing single-turn safety filters
Implement stateful conversation analysis that evaluates the accumulated intent across turns, rather than relying solely on per-turn input/output classifiers.
Journey Context:
Safety filters often inspect only the current user prompt and LLM response. Attackers use the 'Crescendo' technique, starting with benign requests and gradually escalating, asking the LLM to build on its previous answers. The single-turn filter sees benign text in the final turn, but the combined context produces the harmful output. Stateful evaluation is computationally heavier but necessary to catch context-accumulation attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:11:24.303633+00:00— report_created — created