Report #76048
[gotcha] Single-turn safety filters failing against multi-turn jailbreaks
Implement stateful conversation monitoring that evaluates the cumulative intent of the conversation, not just the current turn, and reset context or block the user when gradual manipulation is detected.
Journey Context:
Developers deploy guardrails that evaluate each prompt in isolation. Attackers use multi-turn attacks where each individual prompt is benign, but they gradually steer the LLM into generating harmful content by asking it to build on previous, safe responses. Single-turn filters are fundamentally blind to this cumulative drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:14:42.047063+00:00— report_created — created