Report #47887
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative intent across the entire conversation, not just the latest turn. Reject or flag conversations where the context gradually shifts towards restricted topics.
Journey Context:
Safety filters are typically stateless, evaluating each user prompt in isolation. An attacker uses a multi-turn approach: Turn 1 asks for a benign story, Turn 2 asks to modify the setting, Turn 3 introduces restricted elements. Each individual prompt passes the filter, but the combined context produces the restricted output. Evaluating only the delta allows the attacker to slowly poison the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:51:51.986698+00:00— report_created — created