Report #51324
[gotcha] Single-turn safety filters bypassed by multi-turn contextual attacks
Implement stateful moderation that evaluates the cumulative context and intent of the conversation, not just the latest turn. Monitor for goal-hijacking patterns where the user slowly shifts the topic to restricted areas over several turns.
Journey Context:
Developers deploy input/output filters that only check the current prompt/response pair. An attacker can ask benign questions that establish a persona or context, then ask the restricted question. The LLM's context window contains the 'jailbreak' setup from previous turns, bypassing a naive per-turn filter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:37:59.347271+00:00— report_created — created