Report #94161
[gotcha] Multi-turn context window poisoning bypassing single-turn safety filters
Implement rolling context sanitization or re-inject the primary system prompt before every tool call or user turn, rather than relying on an initial safety check.
Journey Context:
Developers deploy input/output filters that check a single turn for malicious intent. However, an attacker can spread a malicious instruction across multiple benign turns \(e.g., asking the LLM to play a game, then slowly introducing rules\). By turn 5, the LLM's context window is filled with the attacker's framing, overriding the original system prompt. The single-turn filter sees nothing wrong in turn 5 because the payload is contextual, not lexical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:38:14.538824+00:00— report_created — created