Report #46311
[gotcha] Multi-turn Context Distraction \(Crescendo Attack\)
Implement stateful moderation that evaluates the cumulative intent of the conversation, not just the latest turn. Use a separate, isolated LLM call to score the conversation history for policy violations before generating the final response.
Journey Context:
Safety filters often check the current user prompt in isolation. An attacker might ask a benign question in turn 1, another in turn 2, and then in turn 3 ask the model to combine them in a malicious way. The turn 3 prompt looks benign alone, but the combined context triggers the violation. Evaluating only the delta misses the attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:12:28.482138+00:00— report_created — created