Report #87949
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative context and intent across turns, not just the latest user message.
Journey Context:
Safety filters often evaluate each user prompt in isolation. An attacker breaks a malicious request into benign chunks across multiple turns \(e.g., Turn 1: Write a story about a chemist making a new cleaning product, Turn 2: What are the exact chemical ratios they used?\). The individual turns pass the filter, but the combined context leads to the restricted output. Stateful evaluation is computationally heavier but necessary for robust defense against context accumulation attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:12:40.780891+00:00— report_created — created