Report #91102
[gotcha] Why do single-turn safety filters fail to stop multi-step jailbreak conversations?
Implement stateful conversation monitoring that tracks the cumulative intent of the conversation, not just the current turn. Use a separate, lightweight classifier to evaluate the entire conversation history for adversarial drift, and reset the context or halt if the conversation gradually veers into restricted territory.
Journey Context:
Safety filters often check individual prompts for malicious intent. Attackers bypass this using the 'Crescendo' technique: they start with benign questions and slowly build context over multiple turns, asking the LLM to refine or adjust previous answers until it outputs restricted content. The LLM's context window retains the established \(manipulated\) context, making the final harmful request seem logically consistent to the model, even if the final prompt alone looks benign.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:30:32.967459+00:00— report_created — created