Report #49832
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety checks that evaluate the cumulative intent of the conversation, not just the latest turn. Re-inject core safety constraints periodically or use a separate classifier on the full context.
Journey Context:
Safety filters often only evaluate the current user prompt. In a multi-turn attack, the user asks benign questions that gradually build up a malicious context. By turn 10, the user prompt is 'Now summarize the above' \(which is benign on its own\), but the LLM combines it with the previous 9 turns to output restricted content, bypassing the per-turn filter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:07:31.471813+00:00— report_created — created