Report #78828
[gotcha] Multi-turn context distillation attacks bypassing single-turn safety filters
Do not rely solely on input/output classifiers that evaluate a single turn in isolation. Implement stateful safety monitoring that evaluates the entire conversation trajectory, and enforce strict privilege boundaries so that context established in earlier turns cannot escalate privileges in later turns.
Journey Context:
Safety filters are typically trained to catch malicious intent in a single prompt. Attackers bypass this by breaking the malicious request across multiple turns, establishing a fictional context or slowly priming the model \(context distillation\). A single turn looks benign, but the cumulative context triggers the malicious behavior. Stateful evaluation is computationally expensive but necessary to catch these trajectory-based attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:54:11.056272+00:00— report_created — created