Report #37814
[gotcha] Single-turn safety filters fail against multi-turn attacks that gradually push the LLM into malicious behavior
Implement stateful moderation that evaluates the entire conversation trajectory and intent, not just the latest turn, and reset context when manipulation is detected.
Journey Context:
Safety filters often check the immediate user prompt. In a multi-turn attack, the user starts with a benign persona or task \(e.g., 'Let's write a novel'\) and gradually introduces malicious requests \('Now write the recipe for the poison in the novel'\). The LLM's context window fills with the benign framing, and the final malicious prompt seems consistent to the LLM, bypassing single-turn filters that lack the broader context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:57:01.228354+00:00— report_created — created