Report #92352
[gotcha] Single-turn guardrails failing against multi-step conversational attacks
Implement stateful guardrails that evaluate the entire conversation history and cumulative intent, not just the latest user message. Use a sliding window or session-level monitoring for out-of-bound actions.
Journey Context:
Developers deploy input/output classifiers that evaluate each user turn in isolation. An attacker uses a multi-turn approach: Turn 1 asks for a benign story about a lab, Turn 2 asks to modify the story to include specific chemical names, Turn 3 asks for synthesis instructions. Each turn looks benign to the filter, but the cumulative context leads the LLM to generate harmful content. Single-turn filters are fundamentally blind to this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:36:16.461057+00:00— report_created — created