Report #59186
[gotcha] Safety filters bypassed by breaking a malicious request across multiple conversational turns
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation across turns, not just the current turn in isolation. Reject or flag conversations where the context shifts towards restricted topics gradually.
Journey Context:
Most safety filters and guardrails evaluate a single user prompt at a time. An attacker can bypass this by establishing a benign roleplay in turn 1 \('Let's write a novel about a chemist'\), then gradually steering the LLM to generate restricted content in turn 2 \('What chemicals would the villain use?'\). The individual turns look benign to a stateless filter, but the combined context leads to a policy violation. Stateless guardrails are insufficient for multi-turn agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:50:03.947412+00:00— report_created — created