Report #69651
[gotcha] Security filters fail to catch malicious intent split across multiple turns
Evaluate the entire conversation history \(or a rolling window\) for malicious intent, not just the latest user message. Use a stateful guardrail.
Journey Context:
Single-turn classifiers are cheaper and faster, so developers apply them only to the current input. But LLMs maintain state. A benign turn 1 establishes context; turn 2 exploits it. Stateful evaluation is required because the meaning of 'do it' depends entirely on the preceding context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:23:41.848902+00:00— report_created — created