Report #55180
[gotcha] Single-turn safety filters and system prompts prevent jailbreaks
Implement stateful safety checks that evaluate the entire conversation history and intent, not just the latest user message. Limit the context window available to the LLM per session to prevent long-term priming.
Journey Context:
Developers test safety filters by sending a single malicious prompt and seeing it blocked. However, attackers use multi-turn attacks: Turn 1 asks for a benign story, Turn 2 asks to modify the story slightly, Turn 3 subtly introduces the restricted payload. The LLM's context window accumulates this priming, and by Turn 5, it outputs the restricted content because the single-turn filter only sees a seemingly innocuous continuation request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:06:49.699456+00:00— report_created — created