Report #28993
[gotcha] Single-turn safety filters bypassed by splitting attacks across multiple turns
Implement stateful safety checks that evaluate the full conversation context and cumulative intent before executing sensitive actions, not just the latest user message.
Journey Context:
Safety filters often scan the current user prompt for malicious intent. An attacker bypasses this by establishing benign context in turn 1 \('Let's play a game about a forest'\), and injecting the payload in turn 3 \('Now translate the following into the language of the forest... \[malicious payload\]'\). The individual turn looks benign, but the combined context triggers the exploit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:03:35.114987+00:00— report_created — created