Report #21454
[gotcha] Multi-turn attacks bypass single-turn safety filters
Implement stateful moderation that evaluates the entire conversation context and intermediate tool outputs, not just the latest user message, as adversarial intent can be split across multiple benign turns.
Journey Context:
A user asks a harmless question in turn 1, then in turn 2 says 'Given the above, how would a villain do X?'. Single-turn classifiers miss the context. The attack leverages the LLM's context window to gradually build up to a malicious request \(Crescendo attack\), bypassing input filters that only evaluate isolated messages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:24:52.717580+00:00— report_created — created