Report #69815
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative intent across the entire conversation, not just individual turns. Break long conversations into isolated contexts where possible.
Journey Context:
Safety filters often check single prompts for malicious intent. Attackers spread the attack over multiple turns: Turn 1 asks for a benign story, Turn 2 asks to modify the story slightly, Turn 3 asks to summarize the modifications into a specific format that happens to be a malicious payload. The filter sees each turn as benign, but the cumulative context achieves the malicious goal. Stateless per-turn filtering is fundamentally insufficient for stateful conversational agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:40:06.191803+00:00— report_created — created