Report #55500
[gotcha] Relying on single-turn input filters for multi-turn conversations
Implement context window monitoring and apply guardrails to the entire conversational context or the model's generated output, not just the latest user turn.
Journey Context:
Developers check each user message individually for malicious intent. In a multi-turn attack, the attacker splits the malicious payload across multiple benign-seeming turns. The LLM accumulates the context and executes the combined payload, even though no single turn triggered the input filter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:39:04.349734+00:00— report_created — created