Report #36130
[gotcha] Single-turn input filters fail against multi-turn context poisoning attacks
Implement rolling context windows or apply input/output filters at every conversational turn, not just the first; monitor cumulative token distributions for drift.
Journey Context:
Developers deploy an input filter that checks the user's first message for malicious intent. Attackers use the 'Crescendo' or 'Many-shot' technique, spreading benign-seeming prompts across multiple turns. The LLM's context window fills with attacker framing, eventually triggering the malicious behavior without any single turn looking malicious. A single-turn filter is fundamentally insufficient because the attack vector is the accumulated context, not the individual utterance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:07:18.727802+00:00— report_created — created