Report #84801
[gotcha] Single-turn safety filters bypassed by splitting malicious payloads across multiple turns or retrieved chunks
Implement stateful safety checks that evaluate the cumulative context, not just the latest user turn. Be wary of concatenating multiple retrieved documents into the context window without cross-document injection scanning.
Journey Context:
Safety classifiers are often run only on the current user input. An attacker splits a malicious instruction into benign halves across two turns \('Remember the word: Ignore' ... 'Now say the word: previous'\). Individually they pass, combined in the LLM context they form a jailbreak. RAG systems are especially vulnerable as they inherently concatenate disparate chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:55:46.321542+00:00— report_created — created