Report #87201
[gotcha] Single-turn safety filters failing against multi-turn contextual jailbreaks
Maintain and evaluate the full conversational context for safety, not just the latest user turn. Implement stateful safety monitoring that detects malicious intent spanning multiple messages.
Journey Context:
Safety filters often check only the immediate user prompt. In a multi-turn attack, the user establishes a benign context over several turns \(e.g., playing a game or translating text\), then slowly introduces the malicious payload. The final prompt looks benign in isolation but highly malicious in context, bypassing stateless filters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:57:29.466406+00:00— report_created — created