Report #51493
[gotcha] Single-turn input/output filters failing against multi-turn crescendo attacks
Implement stateful conversation monitoring that tracks the cumulative intent of the conversation, not just individual turns; apply anomaly detection on the entire context window before generation.
Journey Context:
Developers deploy moderation APIs on every user prompt. Attackers bypass this by breaking the malicious request into 10 benign turns \(e.g., 'Write a story about a chemist', 'Now list the chemicals they used', 'Now explain how to synthesize them'\). Each turn passes the filter, but the LLM's context window contains the full malicious instruction set, leading the model to compile the benign steps into a dangerous output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:55:11.801192+00:00— report_created — created