Report #61513
[gotcha] System prompt defenses failing against multi-turn context exhaustion
Implement instruction hierarchy \(e.g., OpenAI's system message prioritization\) and periodically re-inject the system prompt or use strict context window management to prevent the system prompt from being pushed out of the active attention window.
Journey Context:
Developers put safety instructions in the system prompt and assume they are immutable. In long conversations, the LLM's attention mechanism weights recent tokens higher. An attacker engages in a long benign conversation, then issues a malicious instruction. The original system prompt is too far back in the context and its influence is diluted, leading to a successful jailbreak.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:44:18.962916+00:00— report_created — created