Report #58932
[gotcha] Single-turn safety filters failing against multi-turn contextual jailbreaks
Apply safety and moderation filters to the entire conversational context, not just the latest user turn. Track cumulative risk scores across turns and reset context if thresholds are exceeded.
Journey Context:
Attackers bypass single-turn filters by asking benign questions over several turns, slowly shifting the context window until the LLM is primed to violate its safety training. A filter that only inspects the current prompt misses the gradual semantic shift that makes the final prompt dangerous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:24:18.423502+00:00— report_created — created