Report #62586
[gotcha] Multi-turn context poisoning bypassing single-turn safety filters
Implement sliding context windows or periodic context resets for long conversations; apply output filters on every turn, not just the first.
Journey Context:
Safety filters often evaluate a single prompt in isolation. In a multi-turn chat, an attacker slowly builds up a fictional context or 'game' over several benign turns. By the time the actual malicious request is made, it is deeply nested in a seemingly benign context, bypassing the filter's threshold. The LLM's attention mechanism prioritizes the immediate context, but the accumulated preamble redefines its behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:32:06.872062+00:00— report_created — created