Report #84063
[gotcha] Assuming a single-turn safety filter or system prompt is sufficient because the model refused the first request
Implement stateful conversation analysis that detects multi-turn manipulation \(e.g., roleplay escalation, context shifting\) and reset the context or hard-stop the session if the user repeatedly attempts restricted actions.
Journey Context:
Attackers use "context shifting" or "slow-drip" attacks. They ask a benign question, then ask to refine it slightly, gradually moving the context window away from the original safety constraints. The LLM loses track of the original instruction over a long context. Single-turn filters miss this because each individual turn looks benign.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:41:36.479442+00:00— report_created — created