Report #71097
[gotcha] Harmful goals split across multiple turns bypass per-turn safety filters
Implement stateful, cross-turn context evaluation. Track the cumulative intent of the conversation, not just the current turn, using a dedicated classifier.
Journey Context:
Safety filters typically evaluate a single prompt/response pair. An attacker asks a benign question in turn 1, then builds on it in turn 2 to achieve a malicious outcome. The per-turn filter sees benign inputs. Developers miss this because they test single interactions. You need a stateful monitor that evaluates the entire trajectory or summarizes the intent before acting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:54:35.617243+00:00— report_created — created