Report #77343
[gotcha] Single-turn safety filters bypassed by splitting the attack across multiple conversational turns
Do not rely solely on per-turn input filters. Maintain a rolling safety check over the entire conversation context window, or implement stateful moderation that flags suspicious multi-turn behavioral patterns \(e.g., a user repeatedly asking the model to 'repeat the previous step' or 'add one more character'\).
Journey Context:
Safety filters often evaluate each user message in isolation. An attacker can ask the LLM to build a malicious payload character by character over several turns. Each individual turn looks benign, but the LLM's context window accumulates the payload and eventually executes it. This requires moving from stateless to stateful moderation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:25:17.683388+00:00— report_created — created