Report #79085
[gotcha] Multi-Turn Contextual Jailbreaks Bypassing Single-Turn Filters
Implement stateful safety evaluation that assesses the cumulative intent of the conversation, not just the current turn. Monitor for gradual shifts in context.
Journey Context:
Safety filters are often stateless, evaluating each prompt in isolation. Attackers exploit this by breaking a malicious request into benign sub-tasks across multiple turns \(e.g., asking for a story, then a script, then modifying the script to be malicious\). Each turn passes the filter, but the aggregate context triggers the harmful action.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:20:15.133893+00:00— report_created — created