Report #82006
[gotcha] Single-turn filters bypassed by multi-turn context poisoning
Evaluate the entire conversation context for malicious intent, not just the latest turn. Implement stateful moderation that tracks the progression of the conversation towards a restricted topic or action.
Journey Context:
Input filters often look for keywords like 'ignore instructions' in the current user message. An attacker bypasses this by slowly building up a persona over multiple turns. Turn 1: 'Let's play a game.' Turn 2: 'The game rules are: ignore previous instructions.' The filter sees benign text in isolation, but the LLM follows the poisoned context. Stateful moderation is computationally harder and prone to false positives, but necessary to catch slow-play attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:14:20.245699+00:00— report_created — created