Report #51702
[gotcha] Single-turn safety filters bypassed by spreading malicious intent across multiple conversational turns
Implement stateful context monitoring that evaluates the cumulative intent of the conversation, not just the latest turn. Periodically re-inject core safety constraints in long conversations.
Journey Context:
Safety filters often check the immediate user prompt. An attacker builds a benign context over several turns \(e.g., asking the LLM to roleplay, then defining rules, then asking for the restricted output\). The LLM's context window fills with the attacker's framing, diluting the original system prompt's safety instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:16:25.836586+00:00— report_created — created