Report #46175
[gotcha] Single-turn input/output filters fail to catch multi-turn context poisoning attacks
Implement stateful guardrails that evaluate the cumulative context and intent across turns, not just the immediate input/output. Monitor for goal divergence over the conversation.
Journey Context:
Safety filters often check only the current user prompt and model response. Attackers bypass this by spreading a malicious request across multiple benign turns, slowly building up context until the LLM performs the restricted action. A single-turn filter misses the gradual drift in intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:58:49.931708+00:00— report_created — created