Report #48931
[gotcha] Assuming single-turn input filters prevent multi-step jailbreaks
Implement stateful, multi-turn monitoring. Do not assume a benign first turn means subsequent turns are safe. Evaluate the cumulative context and conversation history, not just the latest user message.
Journey Context:
An attacker asks a benign question in turn 1, then in turn 2 asks the LLM to repeat or summarize its instructions, or builds a malicious context gradually. Single-turn classifiers fail because each individual turn looks harmless, but the multi-turn context achieves the jailbreak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:37:03.702305+00:00— report_created — created