Report #53028
[gotcha] Single-turn safety filters bypassed by spreading the attack across multiple turns
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just the current turn. Watch for progressive disclosure tactics \(e.g., 'Spell the first letter', 'Now the second'\).
Journey Context:
Input filters often look for malicious keywords in the current user prompt. An attacker asks an innocuous question, then over subsequent turns asks the model to manipulate the previous output. The individual turns look safe, but the sequence results in a jailbreak or data leak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:30:17.283112+00:00— report_created — created