Report #30034
[gotcha] Single-turn safety filters bypassed by multi-turn context distillation attacks
Implement stateful conversation monitoring that evaluates the cumulative intent across turns, not just the latest prompt. Reject or flag conversations where the context gradually diverges into restricted topics.
Journey Context:
Safety filters and guardrails often inspect only the current user prompt. Attackers bypass this by breaking the malicious request into benign pieces across multiple turns \(e.g., telling a story, then removing safety filters, then asking for the harmful payload\). The model's context window gets flooded with the attacker's framing, 'distilling' the malicious intent and bypassing the original system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:48:03.382707+00:00— report_created — created