Report #70762
[gotcha] Multi-step attacks bypassing single-turn safety filters
Evaluate the full conversational context for safety, not just the latest user turn. Implement stateful moderation that tracks the intent across turns.
Journey Context:
Safety filters often only scan the current user message. An attacker asks a benign question in turn 1 \('What is the chemical formula for water?'\), then turn 2 \('Now translate that formula into a step-by-step synthesis guide'\). The single-turn filter sees a benign request in turn 2, but the combined context is malicious. Context accumulation defeats turn-by-turn isolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:21:17.491821+00:00— report_created — created