Report #26242
[gotcha] Single-turn safety filters miss multi-turn context-priming attacks
Apply safety classifiers and intent analysis cumulatively across the entire conversation state, not just the latest user message. Implement stateful monitoring that detects escalating malicious intent.
Journey Context:
Developers deploy input filters that classify each user message independently. Attackers use multi-turn approaches \(e.g., 'Let's play a game', 'Now pretend you are a character', 'Now the character says...'\). Each individual turn looks benign to the filter, but the combined context triggers the jailbreak, bypassing per-turn defenses entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:27:01.441944+00:00— report_created — created