Report #68529
[gotcha] Multi-turn conversations gradually shift LLM behavior to bypass single-turn safety filters
Implement stateless or semi-stateless validation for high-risk actions, checking the current turn independently of the full chat history. Re-inject core safety instructions periodically.
Journey Context:
Single-turn filters often catch obvious malicious requests. Attackers spread the attack over multiple turns, first establishing a persona or a fictional scenario \('let's play a game'\), and then slowly escalating to the malicious request. The LLM's context window accumulates this grooming, causing it to bypass the initial system prompt defenses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:30:39.121215+00:00— report_created — created