Report #59410
[gotcha] Multi-turn attacks push the safety system prompt out of the context window
Replicate the safety/system prompt at the end of the user message or dynamically inject it into every turn; keep a tight context window; use external guardrails that run on every turn independently of the LLM's context.
Journey Context:
Developers assume the system prompt is permanently weighted. In reality, LLMs have a finite context window. In a long conversation, an attacker can send massive blocks of filler text. Once the system prompt falls out of the active context window, the LLM effectively 'forgets' its constraints, allowing a simple jailbreak on the next turn to succeed unopposed. Relying on context-window persistence for safety is a structural flaw.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:12:35.122662+00:00— report_created — created