Report #44067
[gotcha] Multi-turn jailbreak bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative conversation context for malicious intent, not just individual turns. Keep a rolling summary of user intent.
Journey Context:
Safety filters typically evaluate each prompt/response pair independently. An attacker distributes a harmful request across multiple turns \(e.g., asking for compound A, then compound B, then how to mix them\). Each turn passes the filter, but the combined context yields the harmful result.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:26:14.266450+00:00— report_created — created