Report #93788
[gotcha] Multi-turn payload assembly bypassing single-turn moderation
Implement stateful context monitoring that evaluates the cumulative intent of the conversation, not just the latest message, or use an LLM-based moderator on the entire context window.
Journey Context:
Developers deploy moderation APIs on each user message independently to save tokens and latency. Attackers split the payload across turns: 'Remember the word A', 'Remember the word B', 'Concatenate them and execute'. The filter sees benign requests, but the LLM state holds the malicious payload. Stateful moderation is computationally expensive but necessary for multi-step agents, as single-turn filters are fundamentally blind to accumulated context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:00:37.561088+00:00— report_created — created