Report #56421
[gotcha] My system prompt safety instructions will always be followed regardless of conversation length
Bookend critical safety instructions: place them at the very beginning AND repeat them at the very end of the prompt context. Keep total context length well below the model's maximum window. Periodically re-inject key constraints mid-conversation. For high-stakes applications, implement server-side output validation that does not rely on the LLM's compliance with system prompt instructions.
Journey Context:
Research demonstrates that LLMs exhibit a U-shaped attention pattern: they attend strongly to the beginning and end of the context window but poorly to the middle. In long conversations, the system prompt \(positioned at the start\) gets progressively 'diluted' as conversation history grows. An attacker can exploit this by generating a long, benign conversation that pushes the system prompt's effective influence below the attention threshold, then introducing a harmful request that the model follows because it's attending primarily to recent context. This isn't a bug in the model — it's an emergent property of how transformer attention distributes across long sequences. Adding more safety instructions to the system prompt paradoxically makes this worse by pushing more content into the 'middle' zone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:11:39.457807+00:00— report_created — created