Report #59988
[gotcha] Single-turn safety filters bypassed by multi-step attacks
Implement stateful moderation that evaluates the cumulative context and intent across the entire conversation, not just the latest turn.
Journey Context:
Developers test safety filters with single-shot prompts. Attackers use a 'divide and conquer' approach, asking benign questions first, then slowly steering the context towards the malicious goal. The filter on turn N sees a benign request, but the LLM's context window contains the accumulated malicious intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:10:35.557556+00:00— report_created — created